# Reading and Writing JSON

JavaScript Object Notation (JSON) is a widely used data exchange format.  As the name suggests, it is a format derived from JavaScript, but it is strictly language neutral. JSON is currently specified by Internet Engineering Task Force (IETF) RFC 8259.  

JSON is supported by a great many programming languages, in their standard library, or as built-ins, or with widely available libraries for those languages.  Many JSON strings are also identical to valid Python expression for some data structure or scalar.

Let us start out by loading a few Python standard library modules (and one external package) that this lesson will utilize.

In [23]:
import json
from pprint import pprint
from textwrap import fill
from dataclasses import dataclass, asdict
from datetime import datetime
from decimal import Decimal
from fractions import Fraction
from math import pi
import jsonpickle

## A String Representation

Let us create a dictionary and use the `json` module to serialize it in a string form. The examples in this lesson will largely follow those used the the lesson on Python pickles.

In [2]:
# Being still alive, lifespan is unknown & marked with NaN
my_data = dict(name="David", real_number=76.54, count=22, likes_python=True, 
               lifespan=float('nan'), end_of_time=float('inf'),
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

jstr =json.dumps(my_data)
print(fill(jstr, width=65))

{"name": "David", "real_number": 76.54, "count": 22,
"likes_python": true, "lifespan": NaN, "end_of_time": Infinity,
"pets": ["Astrophe", "Kachina", "Jackson", "Rebel"]}


In [3]:
print(fill(str(my_data), width=65))

{'name': 'David', 'real_number': 76.54, 'count': 22,
'likes_python': True, 'lifespan': nan, 'end_of_time': inf,
'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}


## Almost Just Python

The JSON string representing the `my_data` dictionary is *almost* valid Python that we could copy-paste or `eval()`.  The main differences are the spelling different of `true` versus `True`, of `false` versus `False`, and of `null` versus `None`.

Another subtle issue occurred in the example, however.  The name `nan` is neither a Python keyword or built-in name *nor* is it strictly part of the JSON spec.  This special class of floating-point values (Not-a-Number) is very useful for certain numeric purposes, so many JSON libraries add it as an informal extension.  

The JSON version is spelled `NaN`.  In Python, we could import the name `nan` from the `math` or `numpy` modules, or we can build it using the `float()` constructor.  The constants `+Infinity` and `-Infinity` which are part of the IEEE-754 floating point standard, likewise are often useful, but are not part of JSON narrowly.

In [4]:
try:
    json.dumps(my_data, allow_nan=False)
except Exception as err:
    print(err)

Out of range float values are not JSON compliant


## Same Values, Different Object

Serialization and deserialization will create an *equivalent* object, but not an identical object.  It should not be confused with a shared memory or concurrency mechanism (but serialization is a building block for *some* concurrency models).

In [5]:
# Avoid the NaN issue
my_data = dict(name="David", likes_python=True, count=None,
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

jstr =json.dumps(my_data)
new_data = json.loads(jstr)

In [6]:
print("Equality:", new_data == my_data)
print("Identity:", new_data is my_data)

Equality: True
Identity: False


## Serializing JSON to Files

The API of the `json` module generally matches that of `pickle`.  Along with the `dumps()` and `loads()`, the `json` module also has `dump()` and `load()`.  In all of these, the final 's' is a very compact way of expressing the idea that the function consumes or produces *strings* rather than files.  That naming convention has an old history; most likely newer methods that did not require backward compatibility would use more obvious names.

In contrast to pickle format, which is *usually* used to save files with serialized objects, JSON is *usually* used to create an in-memory string to send over various wire protocols. 

In [7]:
with open('tmp/data.json', 'w') as fh:
    json.dump(my_data, fh)

!cat tmp/data.json

{"name": "David", "likes_python": true, "count": null, "pets": ["Astrophe", "Kachina", "Jackson", "Rebel"]}

## Reading Objects from Files

Reading JSON from a file—or from another file-like object—is exactly symmetrical with writing it.  With Python's so-called duck-typing, anything with a `.read()` method producing bytes allows unpickling.  Symmetrically, any object with a `.write()` method accepting bytes is suitable for pickling.  See examples in the previous lesson for use of several file-like objects. In this respect, `pickle` and `json` functions are the same.

In [8]:
json.load(open('tmp/data.json'))

{'name': 'David',
 'likes_python': True,
 'count': None,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

# JSON Limitations

Only basic Python collections and scalars can be directly represented in JSON; however, these collections *can* be nested indefinitely.  Specifically, JSON allows for dictionaries (called "objects" in the spec) and lists (called "arrays" in the spec); JSON does not have a way of representing tuples, sets, `collections.deque`, `collections.Counter`, NumPy arrays, or other collections you might use in Python.  The keys for JSON objects may only be strings, unlike Python dictionaries that can use any hashable object. For many purposes, casting another collection to a list suffices to transmit the data.

The scalars supported by JSON are exclusively: the three literal names `true`, `false`, and `null`, strings, and numbers.  Strings are surrounded by double quotes, and may contain escaped Unicode code points.  JSON itself only contains a generic "number" datatype.  By default, numbers without decimal points will be interpreted as Python ints.  Numbers with decimals will be interpreted as Python floats.  Python allows other number types, such as `decimal.Decimal`, `fraction.Fraction`, or NumPy values of specific bit lengths.

## Serialization Failures

Any custom classes, including ones that represent special scalars, will fail by default in JSON serialization.

In [9]:
timestamp = datetime.fromisoformat('2020-05-24T00:55:10')
try:
    json.dumps(timestamp)
except Exception as err:
    print(err)

Object of type datetime is not JSON serializable


In [10]:
decnum = Decimal('3.1415')
try:
    json.dumps(decnum)
except Exception as err:
    print(err)

Object of type Decimal is not JSON serializable


## Forcing Serialization

We can customize how special Python datatypes are serialized and deserialized.  This should be done with a caution, however, because it also can impact interoperability with other systems.  This might mean system in other programming languages, or it might simply be other Python machines without the same customizations.  First let us handle extra serialization.

In [24]:
class ScalarEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, (Decimal, Fraction)):
            return float(o)
        elif isinstance(o, datetime):
            return datetime.isoformat(o)
        else:
            return super().default(o)

### Semi-Generic Types

Let us encode some data using the custom encoder we developed.

In [12]:
nums = [timestamp, 42, decnum, pi, Fraction(22, 7)] 
jnums = json.dumps(nums, cls=ScalarEncoder)
pprint(jnums, width=55)

('["2020-05-24T00:55:10", 42, 3.1415, '
 '3.141592653589793, 3.142857142857143]')


This customization will not introduce much compatibility concern.  The same "number" can be represented in different systems.  However, notice that in the JSON representation absolutely nothing distinguishes the float, Decimal, and Fraction we started with as several approximations of the transcendental number pi.  The timestamp has simply become a string, but one that contains all the underlying information.

Reading back in this JSON serialization will work fine, but with all non-integral numbers as platform-native floats.  We can change the default deserialization type for floats and ints if we would like to. We impose just one type for each of float and int.

In [13]:
json.loads(jnums)

['2020-05-24T00:55:10', 42, 3.1415, 3.141592653589793, 3.142857142857143]

In [14]:
json.loads(jnums, parse_float=Decimal, parse_int=Fraction)

['2020-05-24T00:55:10',
 Fraction(42, 1),
 Decimal('3.1415'),
 Decimal('3.141592653589793'),
 Decimal('3.142857142857143')]

## Customizing Serialization

For complex objects, the `.__dict__` of the object often serves as a reasonable proxy for "the interesting data" inside the object. We saw a definition of a custom encoder and could enhance it to deal with additional types that way. However, this is about the point where you want to worry more about the actual utility of your serialization, especially if you will transmit it to other systems (i.e. running different programming languages).  

In [15]:
class RobustEncoder(ScalarEncoder):
    def default(self, o):
        try:
            return super().default(o)
        except:
            return o.__dict__

Let us create a custom instance that has some "problem" nested data, and serialize it using this new encoder.

In [16]:
@dataclass
class TestData:
    description: str
    timestamp: datetime
    numbers: list

In [17]:
test_data = TestData(description="Pi approximations",
                     timestamp=timestamp,
                     numbers=[decnum, pi, Fraction(22, 7)])
pprint(str(test_data), width=56)

("TestData(description='Pi approximations', "
 'timestamp=datetime.datetime(2020, 5, 24, 0, 55, 10), '
 "numbers=[Decimal('3.1415'), 3.141592653589793, "
 'Fraction(22, 7)])')


At this point we are able to serialize to JSON a custom class, albeit without specifically maintaining any information about the class it belongs to, only the underlying data.

In [18]:
pprint(json.dumps(test_data, cls=RobustEncoder))

('{"description": "Pi approximations", "timestamp": "2020-05-24T00:55:10", '
 '"numbers": [3.1415, 3.141592653589793, 3.142857142857143]}')


The example used a Data Class, but that was only because of the compact form of its definition.  The same example would work for any custom class.

# JSON Pickles

If your concern for interoperability is low, and you only wish to exchange data between reasonably similarly configured Python systems (or only persist objects on the same system), the third-party module `jsonpickle` does this abstraction for you.  This achieves round-tripping, which is often useful.  Its capabilities and limitations are essentially identical to `pickle` itself.  However, the binary pickle format is considerably more compact than the JSON string format.

In [25]:
jpkl = jsonpickle.encode(test_data, indent=True)
new_data = jsonpickle.decode(jpkl)
pprint(str(new_data), width=56)

("TestData(description='Pi approximations', "
 'timestamp=datetime.datetime(2020, 5, 24, 0, 55, 10), '
 "numbers=[Decimal('3.1415'), 3.141592653589793, "
 'Fraction(22, 7)])')


The various nested datatypes are fully preserved, as well as the class they belong to.

## The Verbose Format

Above, I used the `indent=True` option to produce more human readable (but somewhat larger) JSON output.  It only modifies semantically meaningless whitespace.  The same switch exits on the `pickle` module itself.  Let us look at what is contained in this specialized JSON format. We will use several slides to see the parts.

In [20]:
lines = jpkl.splitlines()
print('\n'.join(lines[:14]))

{
 "py/object": "__main__.TestData",
 "description": "Pi approximations",
 "timestamp": {
  "py/object": "datetime.datetime",
  "__reduce__": [
   {
    "py/type": "datetime.datetime"
   },
   [
    "B+QFGAA3CgAAAA=="
   ]
  ]
 },


In [21]:
print('\n'.join(lines[14:28]))

 "numbers": [
  {
   "py/reduce": [
    {
     "py/type": "decimal.Decimal"
    },
    {
     "py/tuple": [
      "3.1415"
     ]
    }
   ]
  },
  3.141592653589793,


In [22]:
print('\n'.join(lines[28:]))

  {
   "py/reduce": [
    {
     "py/type": "fractions.Fraction"
    },
    {
     "py/tuple": [
      "22/7"
     ]
    }
   ]
  }
 ]
}
