Skip to content

Commit

Permalink
XXX Finish documenting literal syntax.
Browse files Browse the repository at this point in the history
  • Loading branch information
jasone committed Sep 19, 2020
1 parent 4ade2c9 commit 5cc939d
Show file tree
Hide file tree
Showing 2 changed files with 142 additions and 156 deletions.
145 changes: 134 additions & 11 deletions doc/design/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,19 @@ Comments have two syntaxes.
Operators are either prefix or infix, as determined by the leading codepoint.

- Prefix operator: `[~?][-+*/%@<=>|:.~?]+`
- Infix operator: `[-+*/%@<=>|][-+*/%@<=>|:.~?]*`
- Infix operator:
+ `|[-+*/%@<=>|:.~?]+`
+ `[-+*/%@<=>][-+*/%@<=>|:.~?]*`

### Punctuation

In addition to operators, Hemlock uses various symbols as punctuation:

```hemlock
. , ; : ::
( ) [ ] | [| |]
! $ ^ & /& \&
```

### Keyword

Expand Down Expand Up @@ -141,11 +153,119 @@ The first codepoint of an identifier helps distinguish namespaces.

### Integer

XXX
Integers are either signed or unsigned, though a leading sign is always a
separate token. Integer literals may be specified in any of four bases, as
determined by optional base prefix:

- `0b`: Binary, where digits are in `[01]`.
- `0o`: Octal, where digits are in `[0-7]`.
- Default: Decimal, where digits are in `[0-9]`.
- `0x`: Hexadecimal, where digits are in `[0-9a-f]`.

`_` codepoints may be arbitrarily placed before/after any digit as desired to
make literals easier to read.

Integers are unsigned 64-bit by default, but may be unsigned/signed and of
different bitwidth, depending on optional type suffix:

- Unsigned:
+ `u8`: Unsigned 8-bit
+ `u16`: Unsigned 16-bit
+ `u32`: Unsigned 32-bit
+ `u64`/`u`: Unsigned 64-bit (`uns` type)
+ `u128`: Unsigned 128-bit
+ `u256`: Unsigned 256-bit
+ `u512`: Unsigned 512-bit
- Signed:
+ `i8`: Signed 8-bit
+ `i16`: Signed 16-bit
+ `i32`: Signed 32-bit
+ `i64`/`i`: Signed 64-bit (`int` type)
+ `i128`: Signed 128-bit
+ `i256`: Signed 256-bit
+ `i512`: Signed 512-bit

Examples:

- `uns`
```hemlock
0
42
15u
17u64
0x0123_4567_89ab_cdef
0o660
0b10_0001
0b0100_0001
1_000_000
0x___1_fffd
```
- `int`
```hemlock
0i
42i
17i64
0x_ab__c_i
0o777
```
- `byte`
```hemlock
0u8
0xffu8
```

### Real

XXX
Real numbers use the [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) binary
floating point format. Real number literals are distinct from integer literals
in at least one of the following ways:

- Decimal point
- Exponent
- Type suffix

Reals are signed, though a leading sign is always a separate token. Real number
literal mantissas may be specified in either of two bases, as determined by
optional base prefix:

- Default: Decimal, where digits are in `[0-9]`.
- `0x`: Hexadecimal, where digits are in `[0-9a-f]`. All hexadecimal literals
have bit-precise machine representations, which is not universally true of
decimal literals.

Optional signed exponents are separated from the mantissa by a `p` codepoint,
and are always expressed as decimal values, where digits are in `[0-9]`. The
optional exponent sign is in `[-+]`.

`_` codepoints may be arbitrarily placed before/after any mantissa digit or
exponent sign/digit as desired to make literals easier to read. For example:

Reals are 64-bit by default, but may be of different bitwidth, depending on
optional type suffix:

- `r32`: 32-bit ["single-precision
float"](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
- `r64`/`r`: 64-bit ["double-precision
float"](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
(`real` type)

Examples:

```hemlock
0.
0p0
0r
0r64
1.0
1_000_000.0
42.p44
42.3p-78
0x4.a3
0x4a.3d2p+42
1.5r32
0x4a.3d2p+42_r32
1.234_567_p_+89_r32
```

### Codepoint

Expand All @@ -159,6 +279,8 @@ interpolated for a limited set of codepoint sequences.
- `\'`: Single quote.
- `\\`: Backslash.

Examples:

```hemlock
'A'
'\u{10197}' # '𐆗'
Expand Down Expand Up @@ -243,15 +365,16 @@ Token path/line/column locations are ordinarily a simple function of the source
stream from which they derive, but if the source stream is generated from
another source, e.g. using a parser generator, it can be useful to associate
tokens with the pre-generated source locations. Line directives provide a
mechanism for setting the line and/or path for subsequent source lines. Line
directives must start with a `:` codepoint at column 0, followed by a decimal
line number and/or a double-quoted string path. The line directive tokens may
optionally have interspersed/trailing whitespace and/or a trailing hash comment.
mechanism for setting the line and path for subsequent source lines. Line
directives are consumed by the scanner and no tokens result unless there is a
syntax error in the line directive. As such, the line directive syntax is quite
rigid, to the point that even comments are disallowed. Line directives begin
with `:line` at column 0, followed by space-separated unsigned line number and
optional double-quoted string path.

Examples:

```
:42
:"foo.hl" # Omitted line number is implicitly 1.
:42 "foo.hl"
```hemlock
:line 42
:line 42 "foo.hl"
```
153 changes: 8 additions & 145 deletions doc/design/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,44 +86,8 @@ The following integer types are provided:
`uns` is the default integer type, and should be preferred over other integer
types unless the use case demands a specific signedness or bit width. Hemlock
fundamentally depends on at least a 64-bit architecture, so `uns`/`int` always
provide at least a 64-bit range.

Integer literals start with an optional sign if signed, followed by the
magnitude, and finally a type suffix (optional for `uns`). Non-zero signed
integers have an optional `+`/`-` sign prefix; unsigned integers never have an
explicit sign. Magnitudes can be specified in base 10, 16 (`0x` prefix), 8 (`0o`
prefix), or 2 (`0b` prefix). The type suffix matches the pattern,
`[ui](8|16|32|64|128|256|512)?`. `_` separators can be arbitrarily internally
interspersed, so long as they come after the base prefix (if present) and before
the type suffix (if present). Out-of-range literals cause compile-time failure.

`uns` literal examples:

```hemlock
0
42
15u
0x0123_4567_89ab_cdef
0o660
0b10_0001
```

`int` literal examples:

```hemlock
0i
+42i
-14i
0x_ab__c_i
0o777
```

`byte` literal examples:

```hemlock
0u8
0xffu8
```
provide at least a 64-bit range. See [integer literal syntax](syntax.md#integer)
for further details.

### Real

Expand All @@ -138,57 +102,8 @@ Hemlock because they are not commonly supported in hardware, making them of
limited practical use.

Real number literals are always syntactically distinct from integers, due to a
decimal point or type suffix if nothing else. The default `real` bitwidth is 64;
an `r64` type suffix allows the bitwidth to be explicit, and an `r32` suffix
allows specification of 32-bit `r32` constants. Values can be specified in
decimal or hexadecimal; the latter allows bit-precise specification of
constants, whereas many decimal real numbers have no exact binary
representation.

```hemlock
0.
1.0
0p64
1.5r32
4.2
+4.2
42.
42.p44
1p44
3.2p42
1_000_000.0
16p+_42
16.000_342p-42
-inf
inf
+inf
nan
nan_r32
-0.0
0.0
+0.0
4p3
42.3p-78_r64
42.3p-78
42.3p+78
42.3p78
-0x1.42
-0x0.0
+0x0.0
0x0.0
0x4.a3
+0x4.a3
0x4ap42
0x4ap-42
0x4ap+42
0x4a.3d2p+42
0x4a.3d2p+42_r32
-0x1.ffffffc
0x1.ffff_ffc
0x1.ffffffcp-1022
0x1.ffffffcp1023
0x1.ffffffcp+1023
```
decimal point or type suffix if nothing else. See [real literal
syntax](syntax.md#real) for further details.

### Codepoint

Expand All @@ -201,67 +116,15 @@ type codepoint = private u32
```

However, `codepoint` is intentionally type-incompatible with `u32`, thus
requiring explicit validating conversion.

Hemlock's source code encoding is always
[UTF-8](https://en.wikipedia.org/wiki/UTF-8), so the simplest way to specify a
`codepoint` literal is as UTF-8 inside `'` delimiters, e.g.:

```hemlock
'<'
'«'
'‡'
'𐆗'
```

Alternately, codepoints can be specified in hexadecimal:

```hemlock
'\u{3c}' # '<'
'\u{ab}' # '«'
'\u{2021}' # '‡'
'\u{10197}' # '𐆗'
```

Various special characters can be specified via `\` escapes:

```hemlock
'\t' # Tab
'\n' # Newline
'\r' # Carriage return
'\'' # Single quote
'\\' # Backslash
```
requiring explicit validating conversion. See [codepoint literal
syntax](syntax.md#codepoint) for further details.

### String

The `string` type contains a UTF-8-encoded sequence of `codepoint` values. It is
impossible to construct a string with invalid UTF-8 encoding, whether via string
literals or programmatically at run time. Similarly to `codepoint` literals, the
simplest way to specify a `string` literal is as UTF-8, inside `"` delimiters,
e.g.

```hemlock
"Hello"
"A non-ASCII string -- <«‡𐆗"
```

Codepoints can also be specified in delimited hexadecimal, e.g.:

```hemlock
"A non-ASCII string -- \u{3c}\u{ab}\u{2021}\u{10197}"
```

Various special characters can be specified via `\` escapes:

```hemlock
"\t" # Tab
"\n" # Newline
"\r" # Carriage return
"\"" # Double quote
"\\" # Backslash
"\␤" # Newline omitted
```
literals or programmatically at run time. See [string literal
syntax](syntax.md#string) for further details.

## Collection types

Expand Down

0 comments on commit 5cc939d

Please sign in to comment.