Skip to content

Commit

Permalink
[spec/ysh-json] Failing test cases for more errors
Browse files Browse the repository at this point in the history
[doc] Document J8 notation more

- high level description
- make note of errors
  • Loading branch information
Andy C committed Jan 5, 2024
1 parent 92cb228 commit 093fa7e
Show file tree
Hide file tree
Showing 5 changed files with 248 additions and 53 deletions.
131 changes: 112 additions & 19 deletions doc/j8-notation.md
Expand Up @@ -7,15 +7,76 @@ J8 Notation
===========

J8 Notation is a set of interchange formats for **Bytes, Strings, Records, and
Tables**. It's built on JSON, and compatible with it in many ways.
Tables**. It's built on [JSON]($xref), and compatible with it in many ways.

See [ref/index-data.html](ref/index-data.html) for the reference.
It was designed for Oils, but it is **not** specific to Oils. This is just
like JSON isn't specific to JavaScript: today a Python program and a Go program
may communicate with [JSON]($xref), and JavaScript isn't involved at all.


<div id="toc">
</div>

## Goals

- Fix the JSON-Unix mismatch: be able to express byte strings.
- Note: you can still use plain JSON In Oils if **lossy** encodings are OK!
- Provide an option to avoid the Surrogate Pair / UTF-16 legacy of JSON
- Expose some information about strings vs. bytes
- Turn TSV into an **exterior** [data
frame](https://www.oilshell.org/blog/2018/11/30.html) format.
- It can represent tabs! And binary data.

Non-goals:

- "Replace" JSON. It's upward compatible.
- Resolve strings vs. bytes dilemma in all situations.

## J8 Notation in As Few Words As Possible

J8 Strings are a superset of JSON strings:

Only valid unicode:

<pre style="font-size: x-large;">
u'hi &#x1f926; \u{1f926}' &rarr; hi &#x1f926; &#x1f926;
</pre>

JSON: unicode + surrogate halves:

<pre style="font-size: x-large;">
"hi &#x1f926; \ud83e\udd26" &rarr; hi &#x1f926; &#x1f926;
"\ud83e"
</pre>

Any byte string:

<pre style="font-size: x-large;">
b'hi &#x1f926; \u{1f926} \yf0\y9f\ya4\ya6' &rarr; hi &#x1f926; &#x1f926; &#x1f926;
b'\yff'
</pre>

JSON8 is built on top of J8 strings, as well as:

1. Unquoted object/Dict keys `{d, 42}`
1. Trailing commas `{"d": 42,}` and `[42,]`
1. Single-line comments `//` and `#`

TSV8:

1. Required first row with column names
1. Optional second row with column types
1. Gutter Column

## Background

See [ref/toc-data.html](ref/toc-data.html) for the reference.

<!-- TODO: fix CSS -->

It's available in OSH and YSH, but should be implemented in all languages.

TODO:
### TODO / Diagrams

- Doc: How to Turn a JSON library encoder into a J8 Notation library. (Issue:
byte strings vs. unicode strings. J8 is more expressive.)
Expand All @@ -36,19 +97,9 @@ TODO:
- YSH relationships
- Every J8 string is valid in YSH, with `u''`

## Review of JSON

See <https://json.org>

```
[primitive] null true false
[number] 42 -1.2e-4
[string] "hello\n", see J8 Strings
[array] [1, 2, 3]
[object] {"key": 42}
```
## Strings and Bytes

### JSON Strings
### Review of JSON Strings

```
[escaped] \" \\ \/ \b \f \n \r \t
Expand All @@ -59,7 +110,8 @@ See <https://json.org>

TODO: Do we need JNUM a name for JSON numbers?

## J8 string - Byte strings which may be UTF-8 encoded
### J8 strings - Byte strings which may be UTF-8 encoded


```
[unicode] \u{123456} to add UTF-8. No surrogates.
Expand All @@ -83,7 +135,21 @@ Distinguished form:

- The `j` prefix is always present.

## JSON8 - Records built on J8 string
## Tree-Shaped Records

### Review of JSON

See <https://json.org>

```
[primitive] null true false
[number] 42 -1.2e-4
[string] "hello\n", see J8 Strings
[array] [1, 2, 3]
[object] {"key": 42}
```

### JSON8 - Records built on J8 strings

Examples:

Expand Down Expand Up @@ -136,7 +202,9 @@ Canonical form? The shortest form?

- Keys aren't quoted?

## Review of TSV
## Table-Shaped Textual Data

### Review of TSV

See RFC (TODO)

Expand All @@ -155,7 +223,7 @@ Restrictions:
- There's no escaping, so unprintable bytes result in an unprintable TSV file.


## TSV8 - Tables built on J8 string
### TSV8 - Tables built on J8 strings

Example:

Expand Down Expand Up @@ -206,3 +274,28 @@ Canonical form:

TSV8 is always distinguished by leading `!tsv8`.


## FAQ

### Why are byte escapes spelled `\yff` and not `\xff` like C?

Because the JavaScript and Python languages both overload `\xff` to mean
`\u{ff}`.

TODO: example

This is exactly the confusion that J8 notation sets out to fix, so we choose to
be ultra **explicit** and different.

### How Do I Write a J8 Encoder or Decoder?

The list of errors at [ref/chap-errors.html](ref/chap-errors.html) may be a
good starting points.

## Future Work

We could have an SEXP8 format:

- Concrete syntax trees
- with location information
- Textual IRs like WebAssembly
41 changes: 33 additions & 8 deletions doc/ref/chap-errors.md
Expand Up @@ -33,7 +33,12 @@ JSON encoding has three possible errors:

### json-decode-err

TODO
1. The encoded message itself is not valid UTF-8.
- (Typically, you need to check the unescaped bytes in string literals
`"abc\n"`).
1. Lexical error, like the message `+`, an invalid escape `"\z"`, or a
truncated escape `"\u1"`.
1. Grammatical error, like the message `}{`.

## JSON8

Expand All @@ -46,7 +51,11 @@ Compared to JSON, JSON8 removes an encoding error:

### json8-decode-err

TODO
JSON8 has the same decoding errors as JSON, plus:

4. `\u{dc00}` should not be in the surrogate range. This means it doesn't
represent a real character, and `\yff` escapes should be used instead.
4. `\yff` should not be in `u''` string. (It's only valid in `b''` strings.)

## Packle

Expand All @@ -71,17 +80,33 @@ TODO

This is for reference.

### bad-byte
### utf8-encode-err

Oils stores strings as UTF-8 in memory, so it doesn't often do encoding.

- Surrogate range?

### utf8-decode-err

### expected-start
#### bad-byte

### expected-cont
#### expected-start

### incomplete-seq
#### expected-cont

### overlong
#### incomplete-seq

### bad-code-point
#### overlong

I think this is only leading zeros?

Like the difference between `123` and `0123`.

#### bad-code-point

e.g. decoded to something in the surrogate range

Note: I think this is relaxed for WTF-8, and our JSON decoder probably needs to
use it.


78 changes: 58 additions & 20 deletions doc/ref/chap-j8.md
Expand Up @@ -57,7 +57,12 @@ See the [Surrogate Pair Blog
Post](https://www.oilshell.org/blog/2023/06/surrogate-pair.html) for an
example:

\ud83e\udd26
"\ud83e\udd26"

Because JSON strings are valid J8 strings, surrogate pairs are also part of J8
notation. Decoders must accept them, but encoders should avoid them.

You can emit `u'\u{1f926}'` or `b'\u{1f926}'` instead of `"\ud83\udd26"`.

<h3 id="u-prefix">u-prefix <code>u'hi'</code></h3>

Expand All @@ -72,8 +77,8 @@ because they may contain surrogate halves.
In contrast, `u''` strings can only have escapes like `\u{1f926}`, with no
surrogate pairs or halves.

- The **encoded** bytes must also be valid UTF-8, like JSON strings.
- The decoded bytes are valid UTF-8, **unlike** JSON strings.
- The **encoded** bytes must be valid UTF-8, like JSON strings.
- The **decoded** bytes must be valid UTF-8, **unlike** JSON strings.

Escaping:

Expand Down Expand Up @@ -106,6 +111,16 @@ To summarize, the valid J8 escapes are:
\yff # only valid in b'' strings
\u{3bc} \u{1f926} etc.

<h3 id="no-prefix">no-prefix <code>'hi'</code></h3>

Single-quoted strings without a `u` or `b` prefix are implicitly `u''`.

u'hi μ \n'
'hi μ \n' # same as above, no \yff escapes accepted

They should be avoided in contexts where `""` strings may also appear, because
it's easy to confuse single quotes and double quotes.

## JSON8

JSON8 is JSON with 4 rules:
Expand All @@ -115,36 +130,56 @@ JSON8 is JSON with 4 rules:
- trailing commas
- comments?

### Null
### json8-num

Decoding detail, specific to Oils:

Expressed as the 4 letters `null`.
If there's a decimal point or `e-10` suffix, then it's decoded into YSH
`Float`. Otherwise it's a YSH `Int`.

### Bool
42 # decoded to Int
42.0 # decoded toFloat
42e1 # decoded to Float
42.0e1 # decoded to Float

Either `true` or `false`.
### json8-str

JSON8 strings are exactly J8 strings:

### Number
<pre>
"hi &#x1f926; \u03bc"
u'hi &#x1f926; \u{3bc}'
b'hi &#x1f926; \u{3bc} \yff'
</pre>

See JSON grammar.
### json8-list

If there is a decimal point or `e-10` suffix, then it's decoded into YSH float.
Like JSON lists, but can have trailing comma. Examples:

### Json8String
[42, 43]
[42, 43,] # same as above

It's one of 3 types:
### json8-dict

- JSON string
- B string (bytes)
- J string (unicode)
Like JSON "objects", but:

### List
- Can have trailing comma.
- Can have unquoted keys, as long as they're an identifier.

Known as `array` in JSON
Examples:

{"json8": "message"}
{json8: "message"} # same as above
{json8: "message",} # same as above

### json8-comment

End-of-line comments in two styles:

### Dict
{"json8": "message"} // comment

{"json8": "message"} # comment

Known as `object` in JSON

## TSV8

Expand All @@ -164,11 +199,14 @@ These are the J8 Primitives (Bool, Int, Float, Str), separated by tabs.

The primitives:

- Null
- Bool
- Int
- Float
- Str

Note: Can `null` be in all cells? Maybe except `Bool`?

It can stand in for `NA`?

[JSON]: https://json.org

0 comments on commit 093fa7e

Please sign in to comment.