Skip to content

Commit

Permalink
XXX Document various tokens' syntax.
Browse files Browse the repository at this point in the history
  • Loading branch information
jasone committed Sep 14, 2020
1 parent 0a42740 commit d231925
Show file tree
Hide file tree
Showing 2 changed files with 142 additions and 1 deletion.
142 changes: 141 additions & 1 deletion doc/design/syntax.md
Expand Up @@ -7,6 +7,14 @@ OCaml syntax such as that for objects and polymorphic variants, and adds syntax
for distinct features like algebraic effects. But the biggest difference is that
Hemlock uses indentation to rigidly define block structure.

## Encoding

Hemlock uses [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. A [byte
order mark](https://en.wikipedia.org/wiki/Byte_order_mark) prefix will cause a
syntax error and should therefore always be omitted. Encoding errors result in
one or more `` replacement codepoint substitutions, which will cause syntax
errors outside of comments and codepoint/string literals.

## Indentation

Semantically meaningful indentation has multiple advantages:
Expand Down Expand Up @@ -49,7 +57,7 @@ interesting takes on the problem.
Hemlock takes a comparatively simple approach to avoid some of the pitfalls
mentioned above:

- Tabs are forbidden in whitespace. (Tabs can be embedded only in raw string
- Tabs are forbidden in whitespace. (Tabs can be embedded only in string
literals and comments.)
- The first non-whitespace token establishes line indentation.
- Block indentation is four columns.
Expand Down Expand Up @@ -85,4 +93,136 @@ Automated formatting only needs to perform a few actions:

## Tokens

### Whitespace

Whitespace comprises space (`' '`) and newline (`'\n'`) codepoints. Tabs,
carriage returns, form feeds, etc. are not treated as whitespace.

### Comment

Comments have two syntaxes.
- Single-line `#...` comments are delimited by a leading `#` and a trailing
`\n`.
- `(*...*)` comments are delimited by symmetric `(*` and `*)` sequences, which
allows `(* comments (* to be *) nested *)`.

### Operator

XXX

### Keyword

The following words are keywords which are used as syntactic elements, and
cannot be used for other purposes.

```
and external include struct
also false lazy then
as for let to
assert fun match true
constraint function module type
demand functor of val
do halt or when
downto if rec while
else in sig with
```

### Identifier

The first codepoint of an identifier helps distinguish namespaces.

- Capitalized identifiers match `[A-Z][A-Za-z0-9_']*`, and are used for module
names.
- Uncapitalized identifiers match `[a-z_][A-Za-z0-9_']*`, and are used for
value names, parameter label names, type names, and record field names.

### Integer

XXX

### Float

XXX

### Codepoint

Codepoint tokens are delimited by `'` codepoints, and their contents are
interpolated for a limited set of codepoint sequences.
- `\u{...}`: Hexadecimal-encoded codepoint, e.g. `\u{fffd}` is the ``
replacement codepoint.
- `\t`: Tab (`\u{9}`).
- `\n`: Newline, aka line feed (`\u{a}`).
- `\r`: Return (`\u{d}`).
- `\'`: Single quote.
- `\\`: Backslash.

```
'A'
'\u{10197}' # '𐆗'
'\t'
'\r'
'\n'
'\''
'\\'
```

### String

String tokens have three distinct syntaxes, all of which are useful depending on
contents and context:

- **Interpolated** strings are delimited by `"` codepoints, and their contents
are interpolated for a limited set of codepoint sequences.
```
"Interpolated string without any interpolated sequences"
```
The following codepoint sequences are interpolated as indicated:
+ `\u{...}`: Hexadecimal-encoded codepoint, e.g. `\u{fffd}` is the ``
replacement codepoint.
+ `\t`: Tab (`\u{9}`).
+ `\n`: Newline, aka line feed (`\u{a}`).
+ `\r`: Return (`\u{d}`).
+ `\"`: Double quote.
+ `\\`: Backslash.
+ `\␤`: Newline omitted. This allows a string literal to be broken across
lines without the line breaks becoming part of the string.
- **Raw** strings are delimited by matching `` `[a-z_]*` `` sequences, where the
optional tag between the `` ` `` codepoints can be used to distinguish the
delimiters from string contents.
```
``Simple raw string``
`_`String that would ``end prematurely`` without a tag`_`
```
If the raw string begins and/or ends with a `\n`, that codepoint is omitted.
This allows raw string delimiters to be on separate lines from the string
contents without changing the string.
```
``
Single-line raw string
``
# ...
``
Three-line raw string
``
```
- **Bar-margin [raw] strings** are delimited by `` `| `` and a codepoint
sequence matching `` ^[ ]*` ``. Each line past the first one is prefixed by
enough whitespace to align a `|` with the opening delimiter's `|`. The per
line leading whitespace and `|` are omitted from the string; they provide a
left margin for the string contents.
```
`|First line
|Second line
`
```
Note that the final `\n` preceding the closing delimiter is itself part of the
delimiter and is omitted.
```
`|Single-line bar-margin string
`
`|Two-line bar-margin string
|
`
```
1 change: 1 addition & 0 deletions doc/design/types.md
Expand Up @@ -238,6 +238,7 @@ Various special characters can be specified via `\` escapes:
"\r" # Carriage return
"\"" # Double quote
"\\" # Backslash
"\␤" # Newline omitted

## Collection types

Expand Down

0 comments on commit d231925

Please sign in to comment.