Skip to content

The Language

Martin Ender edited this page Jan 14, 2018 · 25 revisions

This revision documents Retina 0.8.2, the last release before Retina 1.0. See here for documentation of the latest version.

Retina is a regex-based recreational programming language. Every program works by reading a (finite) string from standard input, transforming it via one or more regex operations (e.g. matching, splitting, and most of all substituting). Under the hood, it uses .NET's regex engine, which means that both the .NET flavour and the ECMAScript flavour are available.

Retina was mainly developed for code golf which may explain its very terse configuration syntax and some weird design decisions.

Basics

Each program is grouped into stages. Each stage has a type and transforms its input (string) to an output string which is passed to the next stage. The first stage reads from the standard input stream. Each stage may optionally print its result to the standard output stream in addition to passing it on. By default, only the very last stage prints its result. Stages can also be grouped or be applied repeatedly in a loop.

Most stages consist of a single "part", which is the regex itself and potentially a configuration string. A few stages consist of two "parts", where the first is the regex and the second is a substitution pattern.

By default, Retina reads a single source file, where each of those "parts" is a separate line:

./Retina ./program.ret

If you actually want to use multiple lines for each part (e.g. to split a complicated regex over multiple lines in free-spacing mode), you can use the -m flag. In this case, each part is either given as a separate file, or directly on the command line, by using the -e flag. So all of the following are valid invocations:

./Retina -m ./pattern.rgx
./Retina -m -e "foo.*"
./Retina -m ./pattern.rgx ./replacement.rpl
./Retina -m ./pattern.rgx -e "bar"
./Retina -m -e "foo.*" ./replacement.rpl
./Retina -m -e "foo.*" -e "bar"
./Retina -m ./pattern1.rgx ./replacement1.rpl ./pattern2.rgx
./Retina -m ./pattern1.rgx -e "bar" -e "foo*"

Whenever Retina reads a file, it will try to open the file as either UTF-8 or UTF-32. When neither of those is valid, it will open the file as ISO 8859-1 instead. This gives a way to use an encoding where each byte is a single character without having to specify the encoding manually.

Some notes about single-file programs

  • Since linefeeds separate parts, no part can contain a literal linefeed character. However, as mentioned above Retina will replace all pilcrows () with linefeeds after the file has been split into parts. When using ISO 8859-1 encoding, this gives a single-byte way to include linefeeds in the source code. In the unlikely event that you actually need pilcrows in your code and want to use single-line mode, this substitution can be deactivated with the -P flag. When you're not golfing or you don't want to use ISO 8859-1 encoding anyway, you can also uses the escape sequences \n or $n in most places.
  • Retina expects the file to use Unix-style line endings (a single linefeed character, 0x0A). If the file uses Windows-style line endings (\r\n, carriage return 0x0D followed by linefeed), the carriage returns will be contained at the ends of the parts.

Stage part syntax

The first (and sometimes only) part of a stage contains the regex to be used. However, if the file contains at least one backtick (`), the code before the first backtick will be used to configure the stage — this is called the configuration string. As an example, the stage

_Ss`a.

is configured with _Ss (more on that later) and defines the regex as a.. Any further backticks are simply part of the regex (with one exception, also more on that later). If you want to use backticks in your regex, but do not want to configure the stage, just start the stage with a single backtick (which translates to an empty configuration).

If the stage uses a substitution (currently applies to all Replace stages and some Sort stages), it consists of two parts, where the first is a regex with configuration as described above, and the second is the substitution pattern. You can use all the usual substitution elements like $n, as well as some custom ones (see below). If no second part is supplied (but Replace mode is enforced), the replacement string is assumed to be the empty string.

Configuring stages

Most options that can be specified in the configuration string are single characters, whose order is often (but not always) arbitrary. If multiple conflicting options are used, the latter option will override the former.

Some characters are available in all modes, some only in specific ones. Mode-specific options are denoted by non-alphanumeric characters and are mentioned below when the individual modes are discussed.

Regex modifiers

All regex modifiers available in .NET (through RegexOptions), except Compiled are available through the configuration string. This means you don't have use inline modifiers like (?m) in the regex (although you can). All regex modifiers are available through lower case letters. Each letter toggles the modifier (which is off by default):

  • c: Is for CultureInvariant. Quoting MSDN: "Specifies that cultural differences in language is ignored. For more information, see the "Comparison Using the Invariant Culture" section in the Regular Expression Options topic."
  • e: Activates ECMAScript mode and changes the regex flavour. Some of the other modifiers don't work in combination with this.
  • i: Makes the pattern case-insensitive.
  • m: Activates Multiline mode, which makes ^ and $ match the beginning and end of lines, respectively, in addition to the beginning and end of the entire input.
  • n: Activates ExplicitCapture mode. Quoting MSDN: "Specifies that the only valid captures are explicitly named or numbered groups of the form (?…). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…). For more information, see the "Explicit Captures Only" section in the Regular Expression Options topic."
  • r: Activates RightToLeft mode. For more information see MSDN.
  • s: Activates Singleline mode, which makes . match newline characters.
  • x: Activates free-spacing mode, in which all whitespace in the pattern is ignored (unless escaped), and single-line comments can be written with #.

Stage type selection

The default mode for a stage depends on whether there are any further parts in the program after the current one: if this part is the last of the program, the stage defaults to a Match stage. Otherwise it defaults to a Replace stage. The following upper-case letters in the configuration string can override these defaults:

  • M: Match
  • R: Replace
  • S: Split
  • G: Grep
  • A: AntiGrep
  • T: Transliteration
  • O: Sort
  • D: Deduplicate

General Options

The following options apply to all modes:

  • +: Apply the current stage in a loop until the output stops changing (either because the transformation becomes idempotent or because the regex no longer matches). This option makes Retina Turing-complete.
  • ; and :: Turn Silent mode on and off, respectively (this determines whether the result of a stage is printed to the standard output stream). By default non-silent stages always print a trailing linefeed. The order of these with respect to + matters. A ; or : before the + configure whether the entire stage does or doesn't print its result. A ; or : after the + configures whether each iteration of the loop does or doesn't print its result.
  • \: Turn off Silent mode and suppress its trailing linefeed. Like ; and : its relative position with respect to + matters.
  • %: Per-line mode. When this option is used, the input is split into lines (around linefeed characters, 0x0A), and the stage is applied to each line individually. The results are then joined back together with linefeeds. If the % appears before a +, that means the file is split into lines, the stage is applied in a loop to each line, and once all loops have terminated the results are joined back together. If the % appears after a +, the splitting and joining steps are done for every single iteration. This makes a difference when the stage inserts additional linefeed characters.
  • *: Dry run. Implicit turns off Silent mode. After this stage or group completes, restore the string to its previous value. This allows printing strings without having to continue with the string that was just printed.

Limits

Normally each stage tries to find all matches of its regex and then does something with that regex. Limits can be used to restrict which matches should be used for the operation, and sometimes set other restrictions as well. The details of this depend on the type of stage and are listed below. Here, we only consider the syntax of limits.

A limit is simply a signed integer anywhere in the configuration string and refers to some countable concept of the current stage. Some stages support multiple limits, where the limit's position is assigned a fixed meaning (e.g. the first limit concerns matches, the second concerns characters). A 0 always switches off the current limit (which allows you to specify limits in later positions while keeping the first entity unrestricted). Positive limits n set an anchor at the nth entity (counting from 1). Negative limits -n set an anchor at the nth entity from the end (so -1 anchors the last element, -2 the second to last and so on).

By default, if a limit is set, only entities with position less than or equal to the anchor will be processed. That means a limit of 3 might process only the first three matches, or a limit of -2 might process all but the last match. The behaviour of a limit can be changed after the limit was set with one of the following characters:

  • <: Process positions less than or equal to the anchor. (This is the default, and therefore currently a bit useless.)
  • =: Process only the anchor.
  • >: Process positions greater than (and not equal to) the anchor.
  • ~: Process everything except the anchor.

Using any of <=>~ before the first limit is a syntax error.

One more note: leading 0s are treated as separate limits. So the configuration string 04-2>1 would set four limits:

  • The first limit is 0 and therefore unrestricted.
  • The second limit is 4 and processes only the first four entities.
  • The third limit is -2> and processes only the last entity.
  • The fourth limit is 1 and processes only the first entity.

Grouping stages

Stages can be grouped into compound stages. Grouping itself doesn't do anything, but some options can be applied directly to groups which either allows setting of options for multiple stages at once, or allows for completely new control flow.

If a stage's configuration contains one or more (, that stage is the beginning of that many groups. Likewise, if a stage is configured with one or more ), the stage is the end of that many groups. ( and ) cannot both appear on the same stage. If any parentheses are unmatched in the program, they are implicitly closed at the first and last stage of the program, respectively. As an example, consider the following program:

stage1
)`stage2
((`stage3
)`stage4
((`stage5
)`stage6
stage7

In more conventional syntax, the groups would look like this:

(stage1, stage2), ((stage3, stage4), ((stage 5, stage 6), stage7))

A group can be configured on either of its parentheses (this is to ensure that the group is always configurable even if one parenthesis is implied). The configuration always goes left of the groups parenthesis, so that the actual stage configuration is to the right of all parentheses. Example:

outerConfig(innerConfig(stageConfig`stage...
innerConfig)outerConfig)stageConfig`stage...

The following options can be applied to groups:

  • Regex modifiers: Remember that these options are toggles, so they toggle the modifier for all stages inside the group, and can then be switched off for individual subgroups or stages by repeating the modifier on those.
  • Loops: Groups can also be looped with + which repeats the entire group until the string remains unchanged during a full iteration.
  • Silent mode: ;, : and \ can be applied to groups, which affect whether anything is printed (and how) once the entire group is completed. This is independent of whether the last stage in the group prints or not. The above comments regarding the relative order with + still apply.
  • Per-line mode: Groups can be applied to each line separately with %. The above comments regarding the relative order with + still apply.
  • Dry run: * can be applied to groups in the same way it's applied to individual stages, and the string from before the group will be restored at the end.

Finally, since the most common use case of groups is building compound loops, there are two shorthands, { and }. These are short for +( and +), respectively.

Stage types

As outlined above, Retina currently supports 8 different stage types: Match, Replace, Split, Grep, AntiGrep, Transliteration, Sort, Deduplication.

Match

This mode takes the regex with its modifiers and applies it to the input. By default, the result is the number of matches.

Match mode currently supports the following options:

  • !: Instead of the number of matches, the result is a linefeed-separated list of all matches.
  • &: Consider overlapping matches. Normally, the regex engine will start looking for the next match after the end of the previous match. With this mode, Retina will instead look for the next match after the beginning of the previous match. Note that this will still not take into account overlapping matches which start at the same position (but end at different positions.)
  • @: Count or print only unique matches (in order of first appearance). The unique matches are determined after applying any limits (see below).

Match mode currently supports two limits:

  • Limit 1: Determines which matches to consider. With the ! option, this means which matches are included in the resulting list. Without the option, only matches in the restricted range are counted. That means applying a < limit sets a maximum on the output. A > limit subtracts the limit from the output (down to a minimum of 0), etc.
  • Limit 2: This limit only applies to the ! and @ options and determines which characters of the match to include in the output. So a second limit of 3 would only include the first three characters in every match, and a limit of 5= would only include the fifth character in every match if it exists.

Replace

Replace stages do what it says on the tin: they replace all matches of the regex in the input with the substitution pattern, and return the resulting string. Replace mode doesn't have any specific options yet, but comes with one limit:

  • Limit 1: Determines which matches to replace. Other matches left unchanged.

Replace mode does not use .NET's built-in Regex.Replace, but a custom implementation instead. All of the substitution elements $... that are valid in .NET also work in Replace mode (see MSDN for details). However, Retina understands the following additional substitution elements:

  • $n: Is an escape sequence for a linefeed character (0x0A), much like \n inside a regex.
  • $#1, $#{foo}: By inserting # into a group reference, the number of captures made by that group is included (instead of the value of the last capture). This also works with $#+. This creates a simple way to count things and convert unary to decimal.
  • $.1, $.{foo}: By inserting . into a group reference, the length of the (most recent) capture is included (instead of the value). This also works with the substitution elements $.` , $.', $._, $.& and $.+.
  • $%_, $%` , $%': Like $_, $` , $', respectively, but the result is cut off at the closest linefeed before and/or after the match. . and % can be combined in that order. For example, $%_ inserts the current line or $.%` inserts the (0-based) horizontal position of the match. Note that $%_ may span several lines if the match itself includes one or more linefeeds. E.g. if the match is itself a linefeed, $%_ inserts the line in front and after it. Or if a full line is matched including its trailing linefeed, then $%' inserts the next line (without its trailing linefeed).
  • $*_: _ can be replaced with any character. This repeats that character n times where n is the first decimal number in the result of the preceding token. Literal integers are treated as single tokens. Otherwise, each character in a literal is treated as a separate token. This creates a simple way to convert decimal to unary, e.g. by replacing \d+ with $&$*1. If there is no preceding token, $& is implied. If there is no character after $*, 1 is implied. So the previous example can be shortened to replacing \d+ with $*.

Split

This passes the regex on to a custom implementation Regex.Split (that is actually less buggy than the built-in). The result of Regex.Split separated by linefeeds is the result of this stage. By default, captured strings will be included (in the order of their group number) in the output list, in the position where the match was.

Split stages support the following options:

  • _: Filter all empty chunks out of the result before printing. This does not apply to groups which successfully captured an empty string.
  • -: Omit captures from the list. Make sure this isn't followed immediately by a digit or it will be treated as part of a limit.

Split supports two limits:

  • Limit 1: Determines which matches to take into account for splitting.
  • Limit 2: Determines which group numbers to include for each match, when not using the - option. The maximum group number (relevant for negative limits) is the largest group number present in the pattern, even if that group is unused. Note that this might be larger than the number of groups if there are gaps in the group numbers due to using manual group numbers.

Grep and AntiGrep

Grep mode makes Retina assume grep's basic mode of operation: the input is split into individual lines, the regex is matched against each line, and the result consists of all lines that yielded a match.

AntiGrep mode is almost the same, except that it discards all lines which yielded a match.

Both modes come with one limit:

  • Limit 1: For Grep stages, determines which of the matched lines to keep. A limit of 3 keeps the first three, and a limit of 2~ keeps all but the second matched line. For AntiGrep stages, determines which of the matched lines to discard instead. (Hence, the limit always references matches.)

Transliteration

This mode is a rather versatile mode, which is mainly intended for substituting individual characters for others. Its format is slightly different from the other modes, in that it consists of up to four segments on a single line. All of the following are valid Transliterate stages:

config`from
config`from`to
config`from`to`regex

Here, config is just the regular configuration string, which needs to contain T to select this mode. If to is not given or empty it is assumed to be _. If regex is not given or empty it is assumed to be [\s\S]+, which always matches the entire input (unless the input is empty, in which case this stage would be a no-op anyway).

First, both from and to are expanded to lists of characters, using similar rules to character classes in regex:

  • \ escapes the next character. The following escape sequences are known (only for convenience - you can also embed the characters directly in the source code): \a (bell, 0x07), \b (backspace, 0x08), \f (form feed, 0x0C), \n (line feed, 0x0A, if you want to embed this literally, use ), \r (carriage return, 0x0D), \t (tab, 0x09), \v (vertical tab, 0x0B). If a \ is followed by any other character, it just escapes that next character (this let's you add, e.g., ` or \ itself to the list). If there is a single \ left at the end of the stage, it represents a literal backslash.
  • If - is preceded and followed by a single character or escape sequence it denotes a range. As opposed to regex, ranges can be both ascending and descending. 0-4 denotes 01234, but d-a denotes dcba. Ranges can also be degenerate, e.g. a-a. Backticks have to be escaped. A hyphen at the beginning or end of the current segment, or one which follows immediately after another range is treated literally. The character classes listed below are ignored when they appear as one end of a range.
  • There are several built-in character classes. As opposed to regex, they do not need a backslash, so be careful when using literal letters in your code:
    • d is equivalent to 0-9.
    • E is equivalent to 02468.
    • O is equivalent to 13579.
    • H is equivalent to 0-9A-F.
    • h is equivalent to 0-9a-f.
    • L is equivalent to A-Z.
    • l is equivalent to a-z.
    • w is equivalent to _0-9A-Za-z.
    • p is equivalent to <sp>-~, where <sp> is a space.
    • The first o inserts the other set. That is, if o is used in from it inserts to. If o is used in to it inserts from. Subsequent occurrences of o are treated as literals. If o is used in both from and to it is treated as a literal everywhere.
  • Preceding any range or built-in character class with R reverses that range (this also works for o). Multiple Rs can be used as well, although that is currently rather useless (an even number of R is a no-op, an odd number reverses the range). While R does with ranges, it's simpler to just use the opposite range: Ra-z is the same as z-a. R not followed by a range or class is treated as a literal.
  • _ is a special blank. It currently doesn't do anything when used in the from set but it can be used in the to set to delete characters appearing in the from set.

If the expanded to list is shorter than the from list, it will be padded to the same length by repeating its last character. That is, the following specifications are equivalent:

T`d`abc
T`0123456789`abcccccccc

After this preprocessing has been done, the regex is applied to the input string. Parts which aren't matched are left unchanged, but in the matches, every character found in from is replaced by the character at the same position in to. If the corresponding "character" in to is a blank, the character will be deleted instead. Characters not found in from are left unchanged. Remember that to defaults to _, so if no to is given all the characters in from will be deleted.

If a character appears multiple times in from, only its first occurrence is taken into account.

This mode allows simple character transformations which would otherwise require listing every character separately. Some examples:

T`Ll`lL          # Swap the case of all ASCII letters in the input.
T`l`L`\b.        # Capitalise the first letter of each word.
T`L`N-ZA-M       # ROT-13.
T`L`N-ZL         # Also ROT-13.
T`L`RL           # Atbash cypher.
T`L`Ro           # Also Atbash cypher.

Transliterate mode currently doesn't have any dedicated options, but it supports two limits:

  • Limit 1: Determines which matches to process. All matches outside of the constrained range are left unchanged.
  • Limit 2: Determines which characters in each match to transliterate.

Sort

As the name implies, this stage can be used to sort strings. This stage can use either one or two parts (that is, a substitution pattern is optional) — by default it doesn't and only needs a single part like most other stage types. The regex itself can also be omitted and in that case defaults to .+. The input is processed as follows:

  • The regex is applied to the input, and much like in a Split stage is separated into alternating strings which are either delimiters or matches.
  • The matches are then sorted (by default using standard lexicographic string sorting by their code points).
  • The sorted matches are then interleaved with the delimiters again.

As a simple example, the following stage sorts all the non-empty lines in the input:

O`.+

As does this one:

O`

Whereas this stage would sort all the characters in the string by their code points:

Os`.

Options can be used for even more elaborate sorting tasks:

  • #: Sort numerically instead. In each match, Retina looks for the first (signed) integer, and sorts the strings by the value of that integer. Matches that don't contain any digits will be treated as value 0. Sorting in Retina is stable.
  • $: This is essentially "sort by". With this option, the stage expects a substitution pattern on the next line. Matches will be sorted by the result of that substitution. Again, this sorting is stable.
  • ^: This reverses the matches after they have been sorted. Note that this isn't quite the same as sorting in descending order when combined with # or $ (or both), since the order of matches that yield the same key will be reversed.

Sort stages currently support one limit:

  • Limit 1: Determines which matches should be sorted. All matches that don't fall in this range are ignored and simply part of the delimiters around the matches.

Deduplication

This stage can be used to remove duplicates from the input. Like sorting, this stage can use either one or two parts (i.e. it has an optional substitution). Also like sorting, the regex of this stage can be omitted and defaults to .+. Without any options, this simply finds all matches in the string and then removes those matches that are equal to an earlier match. Note that when used with the r modifier (right-to-left matching), matches are processed from the end, so the last occurrence of every substring will be kept.

Deduplication stages currently have only one dedicated option:

  • $: This is essentially "deduplicate by". With this option, the stage expects a substitution pattern on the next line. Equality of matches will be determined based on the result of that substitution.

Deduplication stages currently support one limit:

  • Limit 1: Determines which matches should be considered for deduplication. All matches that don't fall in this range are ignored and will neither be removed, nor be checked for equality with other matches.
Clone this wiki locally