# JSON and serialization
By the end of this lecture you will be able to:
- read and write a `DataFrame` to JSON and newline delimited JSON
- do lazy scans of newline delimited JSON
- serialize and deserialize a `DataFrame`

A newline delimited JSON is a file with a valid JSON object per line. You can learn more about newline delimited JSON here: https://medium.com/@kandros/newline-delimited-json-is-awesome-8f6259ed4b4b

JSON can be used to represent tabular data in a string format and is often used for sending data over a network. In this notebook we see how to parse JSON with Polars and an alternative (and more robust) way to convert a `DataFrame` to a string representation.

In [None]:
from pathlib import Path
from collections import OrderedDict

import polars as pl

We create the following valid JSON string with three rows of data. Note that we prefix the string with `b` so Python converts the string to a bytes object.

We convert to bytes because the `pl.read_json` method interprets a string input as a path to a file. The `pl.read_json` method can also accept a JSON string formatted as bytes, however.

In [None]:
jsonString = b"""
    [
        {"id":1,"values":"a"},
        {"id":2,"values":"b"},
        {"id":3,"values":null}
    ]
"""

In [None]:
(
    pl.read_json(jsonString)
)

Note that if you receive a JSON string (from a HTTP response for example) you can cast the string to bytes so Polars can read it with Python's built-in `bytes` function and the appropriate encoding (utf-8 in this example)

In [None]:
pl.read_json(
    bytes("""
        [
            {"id":1,"values":"a"},
            {"id":2,"values":"b"},
            {"id":3,"values":null}
        ]
    """,
        "utf-8"
         )
)

We can specify the schema of the output with the `schema` argument. Here we pass a `dict` with the mapping for a column

In [None]:
(
    pl.read_json(jsonString,schema={"id":pl.Int64})
)

Or we can pass an `OrderedDict`

In [None]:
(
    pl.read_json(jsonString,schema=OrderedDict([("id",pl.Int64)]))
)

As the data in `jsonString` does not specify the dtypes of the column Polars tries to infer these from the first rows. We can control how many rows Polars uses to infer the dtypes with the `infer_schema_length` argument

In [None]:
(
    pl.read_json(jsonString,infer_schema_length=2)
)

### Writing JSON
We can write a `DataFrame` to a JSON string with `write_json`

In [None]:
df = pl.read_json(jsonString)
df.write_json()

Be aware that we can lose dtype information by writing to and then reading JSON. Here, for example, we cast the `id` column to 8-bit integer before writing to JSON but the final `DataFrame` has a 64-bit integer `id` column. We see an alternative approach to deal with this below.

In [None]:
json_output = (
    pl.read_json(jsonString)
    .with_columns(
        pl.col("id").cast(pl.Int8)
    )
    .write_json()
)
pl.read_json(bytes(json_output,"utf-8"))

## Nested data

JSON may contain arbitrarily nested structures. Polars tries to cast these nested structures to its own nested dtypes. 

### Nested key-value pairs
Polars converts the key-value pair in `values` to a `pl.Struct` dtype *if the types in the nested structure are consistent*. Otherwise an `Exception` is raised

In [None]:
nestedJsonString = b"""
    [
        {"id":1,"values":{"a":0,"b":1}},
        {"id":2,"values":{"a":0,"b":1}},
        {"id":3,"values":null}
    ]
"""

In [None]:
(
    pl.read_json(nestedJsonString)
)

### Nested arrays
Polars attempts to convert arrays to a `pl.List` dtype

In [None]:
nestedArrayJsonString = b"""
    [
        {"id":1,"values":[0,1]},
        {"id":2,"values":[0,1.0]}
    ]
"""

In [None]:
(
    pl.read_json(nestedArrayJsonString)
)

## Newline delimited JSON
In a similar way to standard JSON we can read newline delimited JSON with `pl.read_ndjson`

In [None]:
newlineDelimitedJsonString = b"""
        {"id":1,"values":"a"}\n
        {"id":2,"values":"b"}\n
        {"id":3,"values":null}\n
"""

In [None]:
(
    pl.read_ndjson(newlineDelimitedJsonString)
)

As with standard JSON we can pass a `schema` and set `infer_schema_length`.

We can also do lazy scans of newline-delimited JSON. To show this we must first create a directory to hold an example newline-delimited JSON

In [None]:
# Specify a directory to hold the ndJSON file
ndjson_dir = Path('data_files/ndjson')
ndjson_file = "example.json"
# Create the ndjson sub-directory if it doesn't exist already
ndjson_dir.mkdir(parents=True,exist_ok=True)
# Set the path to the ndJSON file
ndjson_path = ndjson_dir / ndjson_file

Now we create a `DataFrame` from the example above and write it to the example file

In [None]:
(
    pl.read_ndjson(newlineDelimitedJsonString)
    .write_ndjson(ndjson_path)
)

We can now start a lazy query by scanning the ndJSON file

In [None]:
print(
    pl.scan_ndjson(ndjson_path)
    .select("id")
    .explain()
)

At present `pl.scan_ndjson` does not work in streaming mode (as there is no `STREAMING` part of this query plan)

In [None]:
print(
    pl.scan_ndjson(ndjson_path)
    .select("id")
    .explain(streaming=True)
)

## Serialization
Polars has an alternative way to serialize a `DataFrame` to a string representation. First we create our `DataFrame`

In [None]:
df = pl.read_json(jsonString)

We then call `serialize` to get the string JSON representation

In [None]:
(
    df
    .serialize(format="json")
)

In the output we see that the `datatype` for each column is specified. We can pass a file path to `serialize` to write to a file instead of outputting a string.

To convert the serialized output back to a `DataFrame` we call `pl.DataFrame.deserialize` on the file or the string (encoded as bytes for a JSON string)

In [None]:
(
    pl.DataFrame.deserialize(
        bytes(
            df
            .cast(
                {
                    "id":pl.Int8
                }
            )
            .serialize(format="json"),
            "utf-8"
        ),
        format="json"
    )
)

Above we passed the `format=json` argument to get the output as a human-readable JSON. By default, however, the output of `serialize` is a bytes representation of the `DataFrame`

In [None]:
(
    df
    .serialize()
)

We can do re-create the `DataFrame` by passing this output as a `io.BytesIO` object using Python's built-in `io` module

In [None]:
import io
(
    pl.DataFrame.deserialize(
            io.BytesIO(df.serialize())
    )
)