From d2319255d5f5e0f3f61ab5f7e8d13f4f788daa69 Mon Sep 17 00:00:00 2001
From: Jason Evans <jasone@canonware.com>
Date: Sun, 13 Sep 2020 23:22:07 -0700
Subject: [PATCH] XXX Document various tokens' syntax.

---
 doc/design/syntax.md | 142 ++++++++++++++++++++++++++++++++++++++++++-
 doc/design/types.md  |   1 +
 2 files changed, 142 insertions(+), 1 deletion(-)

diff --git a/doc/design/syntax.md b/doc/design/syntax.md
index 5aed8741e..46dcc3984 100644
--- a/doc/design/syntax.md
+++ b/doc/design/syntax.md
@@ -7,6 +7,14 @@ OCaml syntax such as that for objects and polymorphic variants, and adds syntax
 for distinct features like algebraic effects. But the biggest difference is that
 Hemlock uses indentation to rigidly define block structure.
 
+## Encoding
+
+Hemlock uses [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. A [byte
+order mark](https://en.wikipedia.org/wiki/Byte_order_mark) prefix will cause a
+syntax error and should therefore always be omitted. Encoding errors result in
+one or more `�` replacement codepoint substitutions, which will cause syntax
+errors outside of comments and codepoint/string literals.
+
 ## Indentation
 
 Semantically meaningful indentation has multiple advantages:
@@ -49,7 +57,7 @@ interesting takes on the problem.
 Hemlock takes a comparatively simple approach to avoid some of the pitfalls
 mentioned above:
 
-- Tabs are forbidden in whitespace. (Tabs can be embedded only in raw string
+- Tabs are forbidden in whitespace. (Tabs can be embedded only in string
   literals and comments.)
 - The first non-whitespace token establishes line indentation.
 - Block indentation is four columns.
@@ -85,4 +93,136 @@ Automated formatting only needs to perform a few actions:
 
 ## Tokens
 
+### Whitespace
+
+Whitespace comprises space (`' '`) and newline (`'\n'`) codepoints. Tabs,
+carriage returns, form feeds, etc. are not treated as whitespace.
+
+### Comment
+
+Comments have two syntaxes.
+- Single-line `#...` comments are delimited by a leading `#` and a trailing
+  `\n`.
+- `(*...*)` comments are delimited by symmetric `(*` and `*)` sequences, which
+  allows `(* comments (* to be *) nested *)`.
+
+### Operator
+
 XXX
+
+### Keyword
+
+The following words are keywords which are used as syntactic elements, and
+cannot be used for other purposes.
+
+```
+and         external  include  struct
+also        false     lazy     then
+as          for       let      to
+assert      fun       match    true
+constraint  function  module   type
+demand      functor   of       val
+do          halt      or       when
+downto      if        rec      while
+else        in        sig      with
+```
+
+### Identifier
+
+The first codepoint of an identifier helps distinguish namespaces.
+
+- Capitalized identifiers match `[A-Z][A-Za-z0-9_']*`, and are used for module
+  names.
+- Uncapitalized identifiers match `[a-z_][A-Za-z0-9_']*`, and are used for
+  value names, parameter label names, type names, and record field names.
+
+### Integer
+
+XXX
+
+### Float
+
+XXX
+
+### Codepoint
+
+Codepoint tokens are delimited by `'` codepoints, and their contents are
+interpolated for a limited set of codepoint sequences.
+- `\u{...}`: Hexadecimal-encoded codepoint, e.g. `\u{fffd}` is the `�`
+  replacement codepoint.
+- `\t`: Tab (`\u{9}`).
+- `\n`: Newline, aka line feed (`\u{a}`).
+- `\r`: Return (`\u{d}`).
+- `\'`: Single quote.
+- `\\`: Backslash.
+
+```
+    'A'
+    '\u{10197}' # '𐆗'
+    '\t'
+    '\r'
+    '\n'
+    '\''
+    '\\'
+```
+
+### String
+
+String tokens have three distinct syntaxes, all of which are useful depending on
+contents and context:
+
+- **Interpolated** strings are delimited by `"` codepoints, and their contents
+  are interpolated for a limited set of codepoint sequences.
+  ```
+  "Interpolated string without any interpolated sequences"
+  ```
+  The following codepoint sequences are interpolated as indicated:
+  + `\u{...}`: Hexadecimal-encoded codepoint, e.g. `\u{fffd}` is the `�`
+    replacement codepoint.
+  + `\t`: Tab (`\u{9}`).
+  + `\n`: Newline, aka line feed (`\u{a}`).
+  + `\r`: Return (`\u{d}`).
+  + `\"`: Double quote.
+  + `\\`: Backslash.
+  + `\␤`: Newline omitted. This allows a string literal to be broken across
+    lines without the line breaks becoming part of the string.
+- **Raw** strings are delimited by matching `` `[a-z_]*` `` sequences, where the
+  optional tag between the `` ` `` codepoints can be used to distinguish the
+  delimiters from string contents.
+  ```
+  ``Simple raw string``
+  `_`String that would ``end prematurely`` without a tag`_`
+  ```
+  If the raw string begins and/or ends with a `\n`, that codepoint is omitted.
+  This allows raw string delimiters to be on separate lines from the string
+  contents without changing the string.
+  ```
+  ``
+  Single-line raw string
+  ``
+  # ...
+  ``
+
+  Three-line raw string
+
+  ``
+  ```
+- **Bar-margin [raw] strings** are delimited by `` `| `` and a codepoint
+  sequence matching `` ^[ ]*` ``. Each line past the first one is prefixed by
+  enough whitespace to align a `|` with the opening delimiter's `|`. The per
+  line leading whitespace and `|` are omitted from the string; they provide a
+  left margin for the string contents.
+  ```
+  `|First line
+   |Second line
+  `
+  ```
+  Note that the final `\n` preceding the closing delimiter is itself part of the
+  delimiter and is omitted.
+  ```
+  `|Single-line bar-margin string
+  `
+  `|Two-line bar-margin string
+   |
+  `
+  ```
diff --git a/doc/design/types.md b/doc/design/types.md
index 3bac814f3..3a0381ad0 100644
--- a/doc/design/types.md
+++ b/doc/design/types.md
@@ -238,6 +238,7 @@ Various special characters can be specified via `\` escapes:
     "\r" # Carriage return
     "\"" # Double quote
     "\\" # Backslash
+    "\␤" # Newline omitted
 
 ## Collection types