Skip to content

Commit

Permalink
[doc/ref] Document J8 strings
Browse files Browse the repository at this point in the history
  • Loading branch information
Andy C committed Jan 5, 2024
1 parent 643a133 commit 92cb228
Show file tree
Hide file tree
Showing 3 changed files with 85 additions and 26 deletions.
1 change: 1 addition & 0 deletions build/doc.sh
Expand Up @@ -96,6 +96,7 @@ readonly MARKDOWN_DOCS=(
# Data language
qsn
qtt
j8-notation

doc-toolchain
doc-plugins
Expand Down
101 changes: 79 additions & 22 deletions doc/ref/chap-j8.md
Expand Up @@ -11,52 +11,109 @@ JSON / J8 Notation
This chapter in the [Oils Reference](index.html) describes [JSON]($xref), and
its **J8 Notation** superset.

This is a quick reference, not the official spec.
See the [J8 Notation](../j8-notation.html) doc for more background. This doc
is a quick reference, not the official spec.

<div id="toc">
</div>


## J8 Strings

J8 strings are an upgrade of JSON strings that solve the *JSON-Unix Mismatch*.

That is, Unix deals with byte strings, but JSON can't represent byte strings.

<h3 id="json-string">json-string <code>"hi"</code></h3>

All JSON strings are valid J8 strings!

This is important. Encoders often emit JSON-style `""` strings rather than
`u''` or `b''` strings.

Example:

"hi μ \n"

<h3 id="json-escape">json-escape <code>\" \n \u1234</code></h3>

- `\" \\`
- `\b \f \n \r \t`
- `\u1234`
As a reminder, the backslash escapes valid in [JSON]($xref) strings are:

\" \\
\b \f \n \r \t
\u1234

Additional J8 escapes are valid in `u''` and `b''` strings, described below.

### surrogate-pair
<h3 id="surrogate-pair">surrogate-pair <code>\ud83e\udd26</code></h3>

Inherited from JSON
JSON's `\u1234` escapes can't represent code points above `U+10000` or
2<sup>16</sup>, so JSON also has a "surrogate pair hack".

See [Surrogate Pair Blog
Post](https://www.oilshell.org/blog/2023/06/surrogate-pair.html).
That is, there are special code points in the "surrogate range" that can be
paired to represent larger numbers.

### j8-escape
See the [Surrogate Pair Blog
Post](https://www.oilshell.org/blog/2023/06/surrogate-pair.html) for an
example:

- `\yff`
- `\u{03bc} \u{123456}`
\ud83e\udd26

<h3 id="b-prefix">b-prefix <code>b""</code></h3>
<h3 id="u-prefix">u-prefix <code>u'hi'</code></h3>

Used to express byte strings.
A type of J8 string.

- May contain `\yff` escapes, e.g. `b"byte \yff"`
- May **not** contain `\u1234` escapes. Must be `\u{1234}` or `\u{123456}`
u'hi μ \n'

It's never necessary to **emit**, but it can be used to express that a string
is **valid Unicode**. JSON strings can represent strings that aren't Unicode
because they may contain surrogate halves.

<h3 id="j-prefix">j-prefix <code>j""</code></h3>
In contrast, `u''` strings can only have escapes like `\u{1f926}`, with no
surrogate pairs or halves.

Used to express that a string is valid Unicode. (JSON strings aren't
necessarily valid Unicode: they may contain surrogate halves.)
- The **encoded** bytes must also be valid UTF-8, like JSON strings.
- The decoded bytes are valid UTF-8, **unlike** JSON strings.

- No `\yff` escapes
- May **not** contain `\u1234` escapes, must be `\u{1234}` or `\u{123456}`
Escaping:

- `u''` strings may **not** contain `\u1234` escapes. They must be `\u{1234}`,
`\u{1f926}`
- They may not contain `\yff` escapes, because those would represent a string
that's not UTF-8 or Unicode.
- Surrogate pairs are never necessary in `u''` or `b''` strings. Use the
longer form `\u{1f926}`.
- You can always emit literal UTF-8, so `\u{1f926}` escapes aren't strictly
necessary. Decoders must accept these escapes.
- A literal single quote is escaped with `\'`
- Decoders still accept `\"`, but encoders don't emit it.

<h3 id="b-prefix">b-prefix <code>b'hi'</code></h3>

Another J8 string. These `b''` strings are identical to `u''` strings, but
they can also `\yff` escapes.

Examples:

b'hi μ \n'
b'this isn\'t a valid unicode string \yff\fe \u{3bc}'

<h3 id="j8-escape">j8-escape<code>\u{1f926} \yff</code></h3>

To summarize, the valid J8 escapes are:

\'
\yff # only valid in b'' strings
\u{3bc} \u{1f926} etc.

## JSON8

These are simply [JSON][] strings with the two J8 Escapes, and the
optional J prefix.
JSON8 is JSON with 4 rules:

- J8 strings in addition to JSON strings
- unquoted keys
- trailing commas
- comments?

### Null

Expand Down
9 changes: 5 additions & 4 deletions doc/ref/toc-data.md
Expand Up @@ -19,10 +19,11 @@ Siblings: [OSH Topics](toc-osh.html), [YSH Topics](toc-ysh.html)
(<a class="group-link" href="chap-j8.html">j8</a>)
</h2>

```chapter-links-data-lang
[J8 Strings] json-escape \n surrogate-pair
j8-escape \yff \u{03bc}
b-prefix b"" j-prefix j""
```chapter-links-j8
[J8 Strings] json-string "hi" json-escape \" \\ \u1234
surrogate-pair \ud83e\udd26
u-prefix u'hi' b-prefix b'hi'
j8-escape \u{1f926} \yff
[JSON8] Null Bool Number
Json8String
List Dict
Expand Down

0 comments on commit 92cb228

Please sign in to comment.