Skip to content

The Language

Martin Ender edited this page Mar 29, 2020 · 25 revisions

Retina is a regex-based recreational programming language. Every program works by reading a (finite) string from standard input, transforming it via a series of regex operations (e.g. counting matches, filtering lines, and most of all substituting). Retina was built on top of .NET's regex engine, but provides its own, more powerful substitution syntax.

Retina was mainly developed for code golf which may explain its very terse configuration syntax and some weird design decisions.

This document provides a complete documentation of all language features and is quite a behemoth for a recreational programming language. Some annotated example programs and a tutorial can be found the Examples folder of the repository.

Table of Contents

Basics

Retina reads a single source file whose name is provided as a command-line argument:

./Retina program.ret

Retina will attempt to open the file as either UTF-8 or UTF-32. If neither of those are valid, it will open the file as ISO 8859-1 instead. This lets you use an encoding where each byte is a single character, and which includes Retina's most important non-ASCII character, the pilcrow .

The program file is first split into lines, where we will call each line a source. Then all pilcrows are replaced with linefeed characters (0x0A), in order to make it possible to include linefeeds within the individual sources. Therefore, I will usually write to refer to a linefeed in the remainder of the documentation. How these sources are used to form a program will be described in the next section.

An important note on line endings: Retina expects the program file to use Unix-style line endings (a single linefeed character, 0x0A). If the file uses Windows-style line endings (\r\n, carriage return, 0x0D, followed by a linefeed), the carriage returns will remain at the end of each source, which you usually don't want.

To provide input to the program, you need to redirect it to its standard input stream:

echo "some input" | ./Retina program.ret

or

./Retina program.ret < input_file

Retina will not read input interactively from the console. If no input stream is provided, Retina will assume that the input is an empty string. The input stream has to be finite, because Retina reads all input before beginning to process it.

The result of the program as well as any potential intermediate outputs are printed to the standard output stream.

Program structure

A Retina program is made up of a number of stages. There are three (or two and a half) general types of stages:

  • Atomic stages: These are the basic operations the program performs. Each atomic stage is defined by one or more contiguous sources. The number of sources required for each atomic stage depends on the type of stage and its configuration.

    What all atomic stages have in common is that the first source contains the regex used to carry out the operation. This regex can optionally be preceded by a stage configuration, separated with a backtick, `. If the first source of a stage contains at least one backtick, Retina will attempt to read valid configuration options from the beginning of source until it encounters an individual backtick. The remainder of the source is then used as the regex. Note that it's possible for the configuration to contain backticks, because there are some multi-character options which can contain them. The general format of an atomic stage is therefore:

    config`regex
    further
    sources
    ...
    

    where anything after the first line depends on config. See the section on match behaviour options for details on the number of sources required for a stage.

    If you want to use backticks in your regex, but do not want to configure the stage, just start the stage with a single backtick (which translates to an empty configuration).

  • Constant stages: This is technically a type of atomic stage, but it's unique in that it doesn't contain a regex and always consists of a single line. More importantly, a lot of the discussion (and possible configurations) on atomic stages below doesn't apply to constant stages, precisely because they don't contain a regex. The general format is simply:

    config`constant
    
  • Compound stages: These are stages which contain one or more child stages (which may be atomic/constant stages or other compound stages themselves), and modify the way these operate in some way. Hence, the overall structure of a Retina program is a tree of stages with atomic/constant stages as leaves and compound stages as non-leaf nodes.

    Compound stages mostly provide control flow and side effects. They are introduced by special non-alphanumeric characters within stage configurations. The way compound stage syntax works is a bit elaborate (and probably confusing at first), so discussion of it is deferred to its own section.

When the sources have been parsed into stages, we potentially have a list of stages, not a single stage (as there is no guarantee that all stages have been grouped into a single compound stage). To ensure that the program really is just a single tree of stages, this list of stages is implicitly wrapped in a group stage (a type of compound stage).

Implicit output

Retina will normally print the final result of the program at the end of the program. This feature can be disabled with the silent option, ., inside any configuration anywhere in the program.

For the rare case where this is relevant, here is the exact specification of this feature. This part assumes some knowledge about output stages from later sections of the documentation, and can be safely skipped on a first read.

Before implicitly wrapping the top level of stages in a group stage, Retina looks at the final stage in that list. If this stage is not an output stage (another type of compound stage), or it's an output stage with the "pre-print" option active, that final stage gets implicitly wrapped in an output stage. In particular, this means that the implicit output stage can be replaced by an explicit output stage which can then be configured. Here are some examples using a simple to-upper-case program T`l`L (each line should be thought of as a separate program):

T`l`L           # Implicitly prints the upper-cased result.
>T`l`L          # Does the same thing, but the output has been made explicit.
>>T`l`L         # Prints the result twice (one > replaces the implicit one, the
                # other is extra).
\T`l`L          # Prints the result with a trailing linefeed.
;T`l`L          # Prints the result only if any characters were changed.
<T`l`L          # This output stage uses the "pre-print" option, so it doesn't
                # replace the implicit output. Therefore, this program prints
                # both the input and the upper-cased result.
&>T`l`L         # Here, the explicit output stage is itself wrapped in a non-output
                # stage (which does nothing in this case), so the implicit output
                # remains and we get two copies of the result.

Configuring stages

Configurations can get really tricky once compound stages are involved, so we will start by looking at ways to configure atomic stages, then talk about compound stages with a single child and finally about group stages (which are compound stages with multiple children).

As a general rule, if there are multiple mutually exclusive options of the same type, the latter will override the former.

Global flags

Let's start with the simplest type of option: global flags. These can appear in any configuration, anywhere in the program, and modify some part of Retina's overall behaviour.

The descriptions assume knowledge of the feature the flag affects, but will link to the relevant sections of the documentation.

  • .: Silent. This disables Retina's implicit output at the end of the program.

  • !#n: Log limit. n is a positive integer, but can be omitted for zero (i.e. !# and !#0 are identical). This limits the size of the running result log, so that only the last n results are retained. Using !# lets you switch off the result log completely (while keeping the stage history active).

  • !.: By default, most stages register with the history to record their results. This flag toggles this, so that stages from this point onward will not register by default. Output stages and unconfigured dry run stages still won't register. Using the !. again switches the default back on (so you can bracket a section that you don't want to register with the history with two uses of !.). Individual stages can be toggled on or off with the !, option.

    Note that !. is an indication for the parser and is processed before any stages in the current source are created. In particular, the following two stages are equivalent, and the flag affects both the constant stage and the compound per-line stage:

    !.%K`foo
    
    %!.K`foo
    

Stage type

The type of an atomic stage determines what operation this stage performs. The default type of a stage depends on how many sources there are left in the program: if the stage's first source is the program's final source, the stage defaults to a Count stage. Otherwise, it defaults to a Replace stage (which, by default, consists of two sources). So the following program contains two Replace stages (from a to b and from c to d) and one Count stage (counting es):

a
b
c
d
e

These defaults can be overridden by listing an explicit stage type in a stage's configuration. All stage types are upper-case letters:

  • A: AntiGrep
  • C: Count
  • D: Deduplicate
  • G: Grep
  • I: Positions
  • K: Constant
  • L: List
  • N: Sort (numeric)
  • O: Sort (alphabetical)
  • P: Pad
  • R: Replace
  • S: Split
  • T: Transliteration (regular)
  • V: Reverse
  • Y: Transliteration (cyclic)

Note that N and O are technically the same type of stage but with a certain flag set differently, so they will be discussed as a single stage type below. The same applies to T and Y.

Match behaviour options

These options are available for all atomic stages and affect a) how many sources are required to form the current stage and b) how the atomic stage preprocesses its input. This preprocessing is uniform across all atomic stage types. More on this later when we talk about how Retina programs are executed.

  • $: Use a substitution on the matches. By default, atomic stages consist only of a single source containing the stage's regex (and optional configuration). With this option, the stage will also require a source for the substitution pattern. This option is always on for Replace stages (which is why they consist of two sources by default).
  • =: Swap matches and separators. What exactly this means will be explained in a later section.
  • #n: Multi-regex stage. n is a non-zero (but possibly negative) integer. The integer's magnitude indicates how many regexes the stage is using. Each regex goes in a separate source, of course. The integer's sign indicates how the multiple regexes should be used (more on that later).
  • :: Input as regex. This swaps the input and the regex source. This option can't be used in conjunction with #n. If this option is used, the stage's input will be used as its regex and the stage's first line indicates which string this regex should process. This provides a limited form of executing dynamic programs in Retina.
  • @: Keep a single random match. A purely behavioural option, which we'll go into later when looking at the preprocessing of atomic stages. This option is also available for the compound per-line (%) and match mask (_) stages.

There is some non-trivial interaction between the $, = and #n options, in terms of how many sources make up the current stage, so we'll go through the relevant combinations here. Since $ is implicit for Replace stages, we'll be using L stages for the examples.

A stage using only the $ option consists of two sources:

L$`regex
substitution

A stage using only the #n option consists of exactly |n| sources, one for each regex:

L#-4`regex1
regex2
regex3
regex4

A stage using both #n and $ has a separate substitution for each regex, so it requires twice as many sources:

L#3$`regex1
substitution1
regex2
substitution2
regex3
substitution3

And finally, a stage using all of #n, $ and = only has a single substitution (for the separators), which goes at the end, so it requires |n|+1 sources:

L#3$=`regex1
regex2
regex3
separatorSubstitution

In general, Retina expects the program to contain a sufficient amount of sources to complete the current stage. However, if the program is one source short, and that source would be used as a substitution, it defaults to an empty string.

Stage-specific options

Certain options are specific to a single type of stage. These are always two-character options, a ! followed by another character, and are listed below with the specifications of the individual stage types.

Regex modifiers

Regex modifiers are written as lower-case letters. There are two types of regex modifiers, "standard" and "custom".

The standard modifiers are just those most people are familiar with from various regex flavours, and are exactly those provided by .NET (through RegexOptions), except Compiled.

Custom modifiers are really just Retina-specific options, but the way they affect how Retina processes regexes and substitutions is somewhat similar to the standard modifiers, which is why they share the lower-case letters. However, there are a few differences between the two types of modifiers.

For standard modifiers, each letter toggles the modifier (which is off by default):

  • c: Is for CultureInvariant. Quoting MSDN: "Specifies that cultural differences in language is ignored. For more information, see the "Comparison Using the Invariant Culture" section in the Regular Expression Options topic."
  • e: Activates ECMAScript mode and changes the regex flavour. Some of the other modifiers don't work in combination with this.
  • i: Makes the regex case-insensitive.
  • m: Activates Multiline mode, which makes ^ and $ match the beginning and end of lines, respectively, in addition to the beginning and end of the entire input.
  • n: Activates ExplicitCapture mode. Quoting MSDN: "Specifies that the only valid captures are explicitly named or numbered groups of the form (?<name>…). This allows unnamed parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…). For more information, see the "Explicit Captures Only" section in the Regular Expression Options topic."
  • r: Activates RightToLeft mode. For more information see MSDN.
  • s: Activates Singleline mode, which makes . match newline characters.
  • x: Activates free-spacing mode, in which all whitespace in the regex is ignored (unless escaped), and single-line comments can be written with #.

For custom modifiers, each letter sets the modifier. The detailed semantics of these modifiers are explained in the sections on atomic stage preprocessing and substitution syntax:

  • a: Anchors the regex to the entire input string.
  • l: Anchors the regex to line ends.
  • p: Unique matches, keep last.
  • q: Unique matches, keep first.
  • v: Simple overlaps.
  • w: All overlaps.
  • y: Cyclic match adjacency. (Not to be confused with cyclic multi-regex stages, using the #n option.)

a and l are mutually exclusive and override each other. The same applies to p/q and v/w.

When using a modifier on multi-regex stages (i.e. those with the #n option), the modifier applies to all regexes. If you want to apply a standard modifier only to some of the regexes in a multi-regex stage, you can use their inline versions like (?s) or (s:...) inside the regex.

Limits

Limits let you filter certain "lists" of things (these things might be matches, characters, stage results, etc.). Stages can be configured by multiple limits, just list them one after the other. The exact meaning of the first, second, etc. limit depends on the specific stage type and will be listed below.

There are three types of limits.

Exact limits

A single integer n is an exact limit and selects only the element at index n in the list. The indices are zero-based and negative indices can be used to count from the end. If the index is out of bounds for the list in question, the result is simply an empty list.

Examples for applying an exact limit to the string "abcdefghij":

Limit       Result
0           "a"
3           "d"
-2          "i"
10          ""
-11         ""

Range limits

Two integers separated by a comma, m,n, are a range limit and select all elements from index m to n, inclusive. If the element indicated by m would be to the right of the element indicated by n, an empty list is returned. Both m and n are optional and default to 0 and -1, respectively.

Examples for applying a range limit to the string "abcdefghij":

Limit       Result
3,5         "def"
-5,-2       "fghi"
1,-2        "bcdefghi"
-13,15      "abcdefghij"
5,3         ""
-2,1        ""
5,-5        "f"
-5,5        "f"
,3          "abcd"
-3,         "hij"
,           "abcdefghij"

Step limits

Three integers separated by commas, m,k,n, are a step limit and select every kth element from index m to n, inclusive. If k is negative, the limit selects every -kth element from n instead of m. If k has the special value 0, this means that the limit selects only the values at indices m and n, or in other words the step size is n-m after resolving m and n to positive indices. k is optional and defaults to 0. m and n are still optional and default to 0 and -1, respectively.

Examples for applying a step limit to the string "abcdefghij":

Limit       Result
1,2,-1      "bdfhj"
1,3,-2      "beh"
1,-3,-2     "cfi"
1,0,-2      "bi"
1,,-2       "bi"
,4,         "aei"
,-4,        "bfj"
,,          "aj"

Inverse limits

Any limit can be inverted by prepending ^ to it. Doing so selects all elements not selected by the regular limit.

Examples for applying an inverse limit to the string "abcdefghij":

Limit       Result
^0          "bcdefghij"
^-2         "abcdefghj"
^10         "abcdefghij"
^3,5        "abcghij"
^,          ""
^1,2,-1     "acegi"
^1,3,-2     "acdfgij"
^1,-3,-2    "abdeghj"
^1,,-2      "acdefghj"
^,,         "bcdefghi"

String options

Some stages take a string option. There are three ways to specify such a string. In general, you surround the string with double quotes:

"a string option"

To include double quotes inside the string itself, you double them. To represent the string value say "Hello, World!" please, you would use the following option:

"say ""Hello, World!"" please"

If the string option only contains a single character, you can instead just prefix that character with a single quote. Therefore, the following two string options are equivalent:

"!"
'!

Furthermore, if that character happens to be a linefeed, you don't even need that single quote, so these three string options are also all equivalent:

"¶"
'¶
¶

How string options are used depends on the specific type of stage they are used on.

Regex options

Some stages also take a regex as an option. This is usually an alternative to a string option, so you'll rarely need both. This regex is also independent of an atomic stage's main regex (in fact, most stages with a regex option are compound stages). Regex options are surrounded in forward slashes. To include forward slashes inside the regex, escape them with a backslash. You can also append any of the standard regex modifiers to the regex option and they'll apply to this regex. Custom regex modifiers are not available. Example:

/[a-z]\/\d/ic

This represents the regex [a-z]/\d with the case-insensitive and culture-invariant regex modifiers.

If your configuration contains both a regex option and regex modifiers meant for the stage itself, make sure not to put them immediately after the regex option (either put them in front, or put some character that isn't a lower-case letter in between).

List formatting

There are several atomic stages whose result is a list of strings. Namely AntiGrep, Grep, Positions, List and Split. Whenever a stage outputs a list, it will simply join the individual strings with linefeeds (0x0A) by default. However, the options [, |, and ] can be used to specify a custom prefix, delimiter and suffix, respectively.

The syntax is similar to a string option. If the format string contains multiple characters, it is surrounded in double quotes (which can themselves be escaped by doubling them inside). If it's only a single character, a leading single quote suffices. However, the difference to string options is that the single quote can be omitted for any character except single and double quotes.

Let's look at some examples of formatting a list containing the three strings abc, def and ghi. By default, this list is returned as follows:

abc
def
ghi

We can turn this into a comma-separated list by using the option |', or |,:

abc,def,ghi

We can also get a list that looks more like an array in many mainstream languages by using the configuration [[|", "]], which sets the prefix to [, the delimiter to , followed by a space and the suffix to ]:

[abc, def, ghi]

You can also specify an empty string as the separator if you want the results all squished together, using |"":

abcdefghi

General options

The following option is available on all stages (atomic and compound):

  • !,: Toggles whether this stage registers with the history to record its results. By default, all stages do register, but this default can itself be toggled with the global option !.. Output stages and unconfigured dry run stages never register with the history, regardless of these options.

The remaining options are available on most stages, but their exact meaning depends on the stage they're used on. See the sections on the individual stages for details.

  • ^: Reverse. This option reverses, negates or inverts some behaviour of the stage. Make sure not to put this directly in front of a limit, or it will be parsed as an inverted limit instead of a separate option.
  • ?: Random. This option randomises some part of the stage's behaviour.

Compound stages

As mentioned earlier, Retina also has compound stages which wrap around one or more other stages and provide control flow or side effects. Compound stages are specified entirely within the configuration parts of atomic stages, which complicates the syntax a bit.

We will separate compound stages into two subcategories: there are unary compound stages which contain only a single child stage, and there are group stages which contain an arbitrary number of child stages. Let's talk about unary compound stages first.

Unary compound stages

Most compound stages only have a single child stage. Compound stages are denoted by various non-alphanumeric characters. Retina currently supports the following unary compound stages:

  • %: Per-line.
  • _: Match mask.
  • >, <, \, ;: Output. All four characters introduce an output stage, but with different options set on the stage.
  • *: Dry run.
  • +: Loop.
  • &: Conditional.
  • ~: Eval.

For details on the meaning of these stages, see their individual sections below.

To wrap an atomic stage in a unary compound stage, you simply use the corresponding character inside the stage's configuration. When you do, options for the atomic stage itself need to go right of the compound stage, whereas options for the compound stage go left of it. As an example:

/\d/&Os`.

Here, & wraps the atomic stage in a conditional stage. Os are options for the atomic stage and /\d/ is an option for the conditional stage.

Compound stages can also be nested, by listing multiple of them in a single configuration. In this case the leftmost compound stage is the outermost one. In other words, the compound stage closest to the atomic stage's regex is the one that is immediately wrapped around the compound stage. A compound stage's configuration is always immediately left of it. Another example:

/\d/*?>L`\d+

Here we have an atomic stage (List) wrapped in an output stage (>) with a random option (?), which is itself wrapped in a dry run stage (*) with a regex option (/\d/).

Group stages

This is where it gets messy. Group stages (or just groups) are a single type of compound stage which contains multiple child stages. And this is also their primary purpose: bracketing a number of stages so that they can be fed as a single unit to a unary compound stage (although group stages have some options that can make them useful in their own right).

Groups are denoted by parentheses: ( indicates the beginning of a group and ) indicates the end of a group. An important syntactic rule is that ( and ) may never appear in the same configuration. The stage containing the ( will be the first stage of the group and the stage containing the ) will be the last stage of the group.

Furthermore, these parentheses don't need to be matched: if Retina encounters an unmatched ) it assumes an implicit ( at the beginning of the first stage's configuration. If Retina ,encounters an unmatched (, it assumes an implicit ) at the beginning of the last stage's configuration. Let's look at an example:

A`
)C`
D`
(G`
I`
)K`
L`
((N`
)O`
(P`
))S`
((T`
)V`
(Y`

If you go through the parentheses, you'll find that there's one unmatched ) and two unmatched ((. Using a somewhat more readable notation, the stages are grouped as follows:

(AC)D(GIK)L((NO)(PS))((TV)(Y))

It can at first be a bit confusing that the closing ) visually appears before the group's last stage in the source code. Also, as for unary compound stages, the leftmost parenthesis in a single configuration corresponds to the outermost group.

Options for a group can go left of either parenthesis. This is to ensure that groups can be configured even if one of the two parentheses has been omitted. If there are options left of both parentheses, the options of the closing parentheses are considered first. So the following three programs are identical:

3(G`
5)G`
5 3(G`
)G`
(G`
5 3)G`

To combine unary compound stages and groups, we still follow the same basic rules: the leftmost stage is the outermost stage in the program's tree structure, and compound stages on the opening parenthesis are considered to be "right" of compound stages on the closing parenthesis. Therefore, the following three programs are all identical:

>(G`
*)G`
*>(G`
)G`
(G`
*>)G`

In each case we have a group containing two atomic stages, wrapped in an output stage, wrapped in a dry run stage. Note that you should only look until the next parenthesis when trying to combine options and compound stages from both parentheses of a group. This is best explained with another example of identical programs:

&(>(G`
+)*)G`
+&(>*(G`
))G`
(>*(G`
+&))G`

You can think of this as unary compound stages having higher precedence than group stages.

Finally, there are the shorthand notations { and }. The most common use case for groups is to run a number of stages in a loop. Therefore, Retina provides { and } which are shorthand for +( and +), respectively. This is an important distinction from saying that {} marks a group that is wrapped in a loop. In particular, you will almost never want to use both curly braces for a single group. The following two stages are identical, so using two braces would actually wrap the group in two loop stages:

{G`
}G`
+(G`
+)G`

Furthermore, due to the way options and compound stages on the two parentheses interact, you can actually squeeze options and unary compound stages in between the loop and the group stage by introducing the loop with }. Again, the following programs are all identical:

\(G`
}G`
\(G`
+)G`
+\(G`
)G`

A note on regex modifiers

The standard regex modifiers are inherited by the child stages of compound stages. That means you can toggle a regex modifier on a group stage in order to toggle it for all of its child stages. In the following example, all three stages are run with the single-line modifier s:

s(G`.+
O`.
L`.{3}

Since standard regex modifiers are toggles you can disable them for individual child stages by listing them in their configuration again.

How Retina programs are executed

Retina programs form a pipeline for transforming the program's input to an output. There are no variables in the traditional sense. Retina simply keeps track of a working string (initially the contents of the standard input stream), feeds this working string to a stage and then sets the working string to the stage's result. The program is kicked off by running the root of the program's stage tree (which is always an implicit group stage).

History

Retina has a concept of a result history, or history for short. Most types of stages record their result in the history, where it can be referred to from inside regex substitutions.

The stages that do not interact with the history are output stages, as these only cause a side effect (but their result will always be identical to their child stage's result), as well as dry run stages without a regex or string option, as their result will always be identical to their input. The options !. and !, can be used to control whether all other stages register themselves with the history (by default, they do).

This history consists of two parts:

  1. Stage history. For each registered stage, Retina remembers the last result produced by that stage. It's possible for any stage to provide several results over the course of a program if that stage is run in a loop. To access this last result, one needs the stage's index. Index 0 corresponds to the program's input. After that, the program's stage tree is numbered in a post-order traversal, skipping over unregistered stages. In other words, for each compound stage we first number its child stages (down to atomic stages) before assigning a number to the compound stage itself and then progressing to the next stage at the same level.

    Here is an example:

    !,+A`
    (_>C`
    D!,`
    &)*%G`
    

    The stage indices correspond to the following stages:

    0  Program input
    1  A
    2  C
    3  _
    4  G
    5  %
    6  ()
    7  &
    

    Note that + and D are skipped because of the !, option, > is skipped because it's an output stage and * is skipped because it's a dry run stage without a string or regex option.

    Technically, the implicit group stage surrounding the entire program also gets its own index (8, in this case), but it won't record a result before the program terminates and is therefore useless.

  2. Result log. Each registered stage also records its result in a running log. It's possible to refer back to the nth most recent result in this log from a regex substitution, regardless of which stage produced that result. This log initially also contains only the program's input. This log makes it possible to refer to stage results from earlier loop iterations (among other things).

    As this log can potential grow in memory usage indefinitely, Retina contains a couple of measures to control its size:

    First, the result log is only active if at least one substitution in the program contains a $-n history element or a ${…} dynamic element. Otherwise it's impossible to access the result log, so it doesn't need to be kept track of. However, note that even a substitution element like ${a}, which refers to a named capturing group, will activate the result log.

    Second, the global option !#n can be used to limit the log in size. With this option, only the last n results will be remembered. Using !#0 (or the equivalent !#) disables the log completely.

How exactly the history can be accessed is explained in the section on Retina's substitution syntax.

Executing an atomic stage

All atomic stages have a common set of preprocessing steps to perform the basic matches and substitutions and handle any options which are common to all stages. The individual stage types are then defined in terms of the results of this preprocessing.

The gist of this section is that the atomic stage first finds the list of matches (generally all non-overlapping matches of the regex in the input, but various options can change that), then computes the separators (i.e. the segments of string between the matches that weren't matched) and also the list of substitutions (by applying the substitution pattern to each match). These three lists (matches, separators, substitutions) as well as the working string are then handed off to the specific stage type to work with.

This section describes how Retina arrives at these three lists in detail, but can safely be skipped or skimmed unless you're using one of the advanced options like the match behaviour options. One thing that is worth noting though is that the first limit of each atomic stage is used to filter the list of matches.

Retina computes the three lists in a process of up to eight steps:

  1. (Optional) Swap working string and regex source.
  2. (Optional) Anchor the regex(es).
  3. Get all matches.
  4. (Optional) Filter the matches.
  5. (Optional) Reverse the matches for RTL stages.
  6. Compute separators.
  7. (Optional) Invert matches.
  8. Compute substitutions.

Step 3 and 8 are the meat of this process. Step 8 might seem optional because not every stage has a substitution, but if no explicit substitution is given it defaults to $&, which computes substitutions as the identity function.

The following sections go through these eight steps in detail.

(Optional) Swap working string regex source

Normally, a stage's first source (minus the configuration) is used as the source of its regex, and the current working string is used as its input.

If the input-as-regex option : is used, these roles are swapped. The working string will be used as the stage's regex and the stage's first source will be used as its input. This lets you run a dynamic regex against a fixed input.

(Optional) Anchor the regex(es).

If the custom regex modifier a or l is used, the stage's regex(es) are all anchored to the input string or each line, respectively. This is done by wrapping the regex's source in \A(?:…)\z or (?m:^)(?:…)(?m:$), respectively.

Get all matches

This is where it can get complicated. If there are no special options given, this step simply finds all the non-overlapping matches of the regex in the stage's input, much like calling the .NET method Matches. However, various options can affect and complicate this process.

Right-to-left stages

If stage uses the standard regex modifier r for right-to-left mode (RTL), the matches are also obtained in RTL order, i.e. starting at the end of the string. This is relevant for several other options as well as the next couple of steps.

Unique matches

The custom regex modifiers p and q admit only unique matches. If there are multiple matches with the same string value and one of these options is used, only one of those matches will be kept. In the case of the q modifier, only the first such match will be kept. In the case of p identifier, only the last such match will be kept.

Here, "first" and "last" refer to the search order. For example, when using q in conjunction with r, the rightmost match will be kept (as it will be encountered first during an RTL stage).

Overlaps

Normally, matches of a regex are not allowed to overlap (although they may be directly adjacent). The custom regex modifiers v and w change this behaviour.

  • Simple overlaps: When the v option is used, Retina tries to find one match from every possible starting position. It is still not possible to get multiple matches from the same starting position.

    Note that for RTL stages a match's starting position is its right end. Therefore, matching .+ against a string with the v option gives all suffixes of the input, but if the r option is also used, it gives all prefixes instead.

  • All overlaps: When the w option is used, Retina actually finds all substrings of the input which match the given regex.

    This cannot be done without modifying the regex in a way that isn't free of side-effects. In particular, it introduces a named group called _suffix, which you should avoid inside the regex. If the r option is not used then it also shifts the group numbers of named groups inside the regex up by one. So the group number of group a in (.)(?<a>.) would become 3 instead of 2.

    The search order of overlapping matches goes through all possible substrings first by starting position (remember that this is the right end for RTL stages), and second by length (in increasing order). So matching .+ with the w option against ab gives a, ab, b. If the r option is also used, we'd get b, ab, a instead.

    Warning: Using this option can make a stage significantly slower, depending on the regex and the input, because the regex engine will have to backtrack a lot more to reach the prescribed length of each match.

Multi-regex stages

A stage can be given multiples regex sources (and substitutions), when the #n option is used. When n is positive, the regexes will be used cyclically. When n is negative, the regexes will be used greedily.

  • Cyclic: Using the regexes cyclically means that the first match will be made with the first regex, the second match with the second regex, and so on until we reach the nth match/regex. After that, we start over using the first regex, then the second again and so on. A common use case for this is to have two regexes, one for the elements of a list, and one for the delimiters between elements. The stage will then match both elements and delimiters by alternating between the two regexes. The benefit of doing this with two regexes instead of a single one (using an alternation), is that this lets you specify different substitutions for the different regexes.
  • Greedy: Using the regexes greedily means that Retina will try to find a match for each regex in the stage and then pick the earliest match for the current search order (usually the leftmost match, but it will be the rightmost match for RTL stages, and it might be the shorter of two matches at the same position when the w option is used). If multiple regexes find a match at the same position, Retina will pick the regex (and corresponding match) which is listed first in the stage's sources.

(Optional) Filter the matches

Two options can be used to filter the matches:

  1. The first limit in any atomic stage is used to filter the list of matches. Note that the list is still in search order, so on an RTL stage the limit 0 would select the rightmost match, and the limit -2, would select the two leftmost matches.
  2. Afterwards, if the @ option is used, of the remaining matches one is picked uniformly at random (and the match list will then contain only that match). If the match list is empty, it will simply remain empty, of course.

(Optional) Reverse the matches for RTL stages

If the stage used the RTL option r, the list of matches will now be reversed. This ensures that for a regular match (without overlap options), the resulting list will still be in the regular (LTR) string order. This is what you'd expect from running such a regex through .NET's Matches methods. Of course, for stages with overlap matches, the result match list isn't necessarily sorted by their left-hand positions.

Compute separators

Now the separators are computed, which are the (possibly empty) strings between matches, as well as the part before the first and after the last match. There will always be one more separator than the number of matches, so that by alternating between separator and match we can cover the entire input.

As a simple example, consider matching the input Hello, World! with the regex \w+. This yields the matches "Hello" and "World" and the separators "" (in front of the first match), ", " and "!".

This gets a bit tricky with overlapping matches. While this situation will be irrelevant in 99% of use cases, the behaviour is specified anyway. If two adjacent matches overlap, an empty separator is still inserted between them. Also, only adjacent matches are considered for computing separators, which means that the set of characters covered by the matches isn't necessarily disjoint from the set of characters covered by the separators. Consider matching the input abcdcba with the regex bcd|c using one of the overlap options v or w. We would obtain the matches bcd, c and another c. The separators would be a, empty (between bcd and c), d (between the two cs), ba. Note that the d is part of the separator even though it's covered by the earlier bcd match.

Retina will actually back these separators by matches with automatically generated regexes, which can be important for the next step and for certain substitutions. These regexes will never contain any groups and simply match a string of the required length at the correct position. When matches overlap, the (empty) separator will be at the left end of the second match.

(Optional) Invert matches

If the invert-matches option = is used, Retina will now swap the roles of the matches and separators. That is, for the remainder of the current stage, the separator list is treated as the match list and vice-versa. To ensure that we still have one more separator than matches, the original match list (i.e. the new separator list) is augmented with an empty separator at the beginning and end of the string.

An example should help. Consider again the regex \w+ and the input Hello, World!. By default we get the following match and separator list (spaced for clarity):

Matches:       "Hello"      "World"
Separators: ""         ", "         "!"

If we now use the = option, the new match and separator lists become:

Matches:       ""         ", "         "!"
Separators: ""    "Hello"      "World"     ""

Compute substitutions

If the $ option was used on the stage (or the stage is a Replace stage), the program will contain one substitution for each regex. The only exception to this is if the invert-matches option = was used (in which case there will be only one substitution, regardless of number of regexes). If the $ was not used, all of these substitutions default to $&, the identity substitution.

The reason for using only a single substitution with the = option is that it's no longer possible to associate a unique regex with each match.

This step now goes through all the matches and applies the appropriate substitution pattern to it. This way, we obtain a list of substitutions with the same length as the list of matches. The separators are left untouched.

Substitutions work similarly to .NET's Replace method, but Retina has a few small incompatibilities and in return adds a whole number of new features. For details see the section on substitution syntax.

This concludes the preprocessing steps of atomic stages.

Substitution syntax

Retina's syntax for substitution patterns is based on the syntax used by .NET's Replace method and various other regex flavours. So you can insert a match with $& or a capturing group with $1 for example. However, Retina also provides a number of advanced operators inside substitutions as well several other basic substitution elements. This comes at the cost of sacrificing compatibility with .NET's syntax at a few corners. This section first goes over these differences and then explains its own syntax in detail.

Differences from .NET's syntax

This section only lists features you might expect coming from .NET, which aren't present or are different in Retina. Of course, all the added features are also differences, but those will be explained below.

  • In .NET, $_ inserts the entire input. Retina uses $= instead (for reasons that will become apparent later). $_ is just inserted literally in Retina.
  • In .NET, $+ inserts the capturing group with the largest group number. Retina does not provide this feature at all, and $+ has a different, unrelated function.
  • In .NET, all substitution syntax elements start with $, so every other character can be used without having to escape it. In Retina, * is also a special character that needs to be escaped to insert it literally. ) and } can normally be used without escaping them, unless you use their counterparts $( and ${, in which case ) and } need to be escaped as well.
  • In .NET, when referring to a group by number of name which doesn't exist in the regex, the reference to the group is inserted into the result literally. E.g. replacing (\d+) with $2 just replaces each number with $2 (literally). In Retina, references to unused groups are dropped from the result, so the same substitution would remove all numbers from the string.

Basics

Retina's substitution syntax is actually a little mini language (whose parser is even higher quality than Retina's main parser itself...) made up of expressions in the form elements and operators. Each expression generates some string for the output, based on the match, the input string, adjacent matches and separators as well as the program's history. Certain string transformations can be performed directly inside the substitution using unary and binary operators.

An expression can be one of the following:

  • An escape sequence like $$ for a literal dollar sign or $* for a literal asterisk.
  • A shorthand element. These are mostly the usual elements you are probably used to like $&, $1, $' etc. but also several others.
  • A dynamic element ${…}, where can be a concatenation of expressions.
  • A grouping $(…), where can be a concatenation of expressions. This really just acts like parentheses in other languages and is used to feed a concatenation of elements as a single operand to an operator (because concatenation itself has the lowest precedence of any operator). Groupings also have the alternative form $.(…), which computes and produces the length of instead.
  • A unary (prefix) operator, like $l or $^.
  • The binary operator *. It's right-associative, both operands are optional and it has a higher precedence than the unary operators.
  • A positive integer. These are just literals, but it's important that operators treat multi-digit integers as a single expression instead of just apply to their first or last digit.
  • Any other single character. This is then also just a literal.

A full substitution pattern is then just a concatenation of expressions. The details of the various types of expressions are explained in the following sections.

Literals and escape sequences

Any character that doesn't form another valid type of expression is simply a literal and just generates itself as output. This includes dollar signs which are not followed by the correct syntax for any other type of expression. E.g. $X is a concatenation of two literals $ and X.

As mentioned above, strings of digits are treated as a single literal expression. When used at the top level of a substitution pattern, this distinction is irrelevant, but it means that integers can be fed to operators without having to group the digits.

Retina provides the following escape sequences for characters that otherwise have a special meaning or can't be inserted easily:

  • $$ for $.
  • $* for *.
  • $) for ).
  • $} for }.
  • $n for a linefeed (0x0A), although this can also be entered as in the program's source code.
  • (which really is a dollar sign followed by a linefeed) for an actual pilcrow (0xB6).

Shorthand elements

Shorthand elements are the primary way to generate information from the match, or its surrounding context, or the program's history. They include (almost) all the special substitution elements from .NET's syntax but also provide several other features.

There are several types of shorthand elements:

  • Full match: Both $& or $0 generate the entire match.

  • Numbered group: $n, where n is a positive integer, generates the last string captured by group n. If the regex doesn't contain a group n or the group didn't capture anything, this generates an empty string.

  • Context: $= generates the current stage's entire input string (this is $_ in .NET's substitution syntax). $` generates the match's prefix, i.e. everything in the input up to the point where the match starts. $' generates the match's suffix, i.e. everything in the input after the point where the match ends.

    Retina also provides $" as a shorthand for $'¶$` . This may seem odd, but it lets you generate two or more possible substitutions at once (separated by linefeeds). Consider matching c in abcde and replacing it with <$">. The result is:

    ab<de
    ab>de
    
  • History: $+n, where n is a non-negative integer, generates the latest result from stage n. $+ is equivalent to $+0 and generates the program's input. $-n, where n is a non-negative integer, generates the nth most recent stage result. $- is equivalent to $-0 and generates the result of the previous stage. See the history section for details.

It is also possible to insert certain modifiers into these elements. The modifiers always go after the $. Not all modifiers are available on all types of elements, but if multiple modifiers are used, they need to given in the order listed below:

  • Length: .. This can be used with all types of elements. Generates the length of the string that would be generated by the same element without the .. E.g. $.& gives the length of the match and $.+ gives the length of the program's input.

  • Adjacent: Any one of <>[]. This can be used with full match elements, numbered groups and context elements. $<x generates $x in the context of the separator left of the current match. $>x does the same for the separator right of the current match. $[x generates $x in the context of the match left of the current match (and analogously for $]x).

    Normally, $[x will generate an empty string if the current match is already the leftmost match. However, if the current stage uses the custom regex modifier y the matches are treated cyclically instead, so that $[x on the leftmost match would refer to $x in the context of the rightmost match and vice versa for $]x.

  • Generic modifiers: Any one of #:;?. can be used with full match elements or (in the case of #?) numbered groups.

    $#& or $#0 generates the number of groups with successful matches. $#n generates the number of captures performed by group n.

    $:& or $:0 generates the match's 0-based index from the left (regardless of whether the regex was matched LTR or RTL). If this gets shifted to a separator with < or >, it generates the separator's index instead. In particular, $>:& effectively generates the 1-based index of the match.

    $;& or $;0 generates the match's 0-based index from the right (regardless of whether the regex was matched LTR or RTL). If this gets shifted to a separator with < or >, it generates the separator's index instead. In particular, $<;& effectively generates the 1-based index of the match.

    $?& or $?0 generates the last value captured by a uniformly random group. This may pick a group that didn't have any captures at all, in which case the result is an empty string. $?n generates a random capture from group n.

  • Line only: %. This can be used with context elements. If it is used, the generated string stops at the nearest linefeed (searching from the match). In other words, $%= generates the line the match is on, $%` generates the match's prefix on its line, $%' generates the suffix on its line.

    You can also use $%" as a shorthand for $%'¶$%` . Note that $%= may contain linefeeds if the match itself contains linefeeds (because it's actually equivalent to $%`$&$%').

Dynamic elements

If you're familiar with .NET's substitution syntax, you might have seen the syntax ${abc} to refer to the result of a named group (?<abc>…). Retina also provides this syntax but makes it a lot more powerful: the contents of a ${…} can be an arbitrary expression, which is evaluated before looking up what the entire ${…} refers to. That's why these are called dynamic elements in Retina. For example, they allow you to choose between two different groups based on a third one.

After evaluating the child expression of a dynamic element, there are three possible cases:

  • The result is a string of the form [_a-zA-Z]\w*, optionally preceded by any of the modifiers which are allowed for numbered groups. If a group of the same name exists in the match we're currently replacing, then this dynamic elements generate the group's captured value. Otherwise, it generates an empty string.
  • The result is equal to any of the shorthand elements (minus the leading dollar sign which introduces the shorthand). In this case, this dynamic element generates the value of that shorthand element. For example ${.#2} (or any dynamic element that evaluates to the same string) is equivalent to $.#2.
  • In any other case, this dynamic element just generates an empty string.

This means that there are only two cases where you'd ever use the ${…} syntax: either you actually need its dynamic nature and want to select one of several different elements dynamically, or you want to refer to a named group (because there are no shorthand elements for named groups). There's no reason to use a dynamic element with static contents if those contents are not a named group, because every other possible element has a shorthand form.

This relationship between shorthand and dynamic elements is the reason that Retina does not support .NET's $_ (and uses $= instead), because it would be ambiguous whether ${_} refers to $_ or the named group (?<_>…) (which is a valid group name).

Groupings and length computation

We'll get to operators in the next section, but they'll always just take the nearest full expression as an operand, because concatenation has a lower precedence than any of the operators. In order to concatenate expressions and use them as a single operand, they can be grouped with $(…).

However, groupings have a second function: by using the alternative form $.(…), Retina will instead generate the length of the result of the expression . This length computation is in fact lazy and whenever possible the full result of is never generated. For example, jumping slightly ahead to the repetition operator, $.(500*$&) won't actually generate a string containing 500 copies of the match. Instead, it will just determine the length of $& and then know that the length of 500*$& is just 500 times that length. This makes $.(…) quite efficient and even allows the insertion of some string lengths where the string itself could never be held in memory at once. This length computation uses arbitrary-precision integers.

Unary operators

Retina provides a number of unary operators, which are simply written in front of the expression whose result they should transform. Remember that these have a higher precedence than concatenation, so operands consisting of more than one expression need to be grouped using $(…).

  • $^: Reverses its operand.
  • $\: Escapes its operand for use as a literal inside a regex. This is done by using .NET's Regex.Escape method and then also replacing all / with \/. This is useful for generating code dynamically to be used with the : option or eval stages.
  • $l: Converts the first character of its operand to lower case.
  • $L: Converts its operand to lower case.
  • $u: Converts the first character of its operand to upper case.
  • $U: Converts its operand to upper case.
  • $T: Converts its operand to title case, by finding all runs of letters (characters matched by \p{L}) and converting the first one in each to upper case and the remainder to lower case.

Repetition

Retina also has one binary operator, *, which is used to repeat a character or string. The right-hand operand is the string to be repeated and the left-hand operand determines the number of repetitions.

Since the left-hand operand is in general an arbitrary string it first needs to be converted to an integer. This is done by finding the first match of \d+ and interpreting it as a decimal integer. If the left-hand operand contains no digits, the value is treated as 0.

The repetition operator is right-associative, so chaining multiple repetitions multiplies their repetition counts: 3*5*x generates a string of 15 xs.

Both operands of * are optional. The left-hand operator defaults to $& (the entire match) and the right-hand operator defaults to _ (a literal underscore). In line with the operator's associativity, an implicit operand between two * is assumed to be a left-hand operand, so ** is short for $&*$&*_.

The repetition operator has a higher precedence than any unary operator, because there is usually no point in applying a unary operator to left-hand operand of a repetition.

A note on unmatched brackets

A small fact that has been glossed over above is that both groupings, $(…), and dynamic elements ${…} don't require the closing bracket. If the parser reaches either the the end of the substitution pattern, or a closing bracket of the opposite type, while parsing one of these two expressions, it will assume an implicit closing bracket and automatically end the current expression. The following substitution patterns are therefore all equivalent:

$(a${b$.(c)})
$(a${b$.(c)}
$(a${b$.(c)
$(a${b$.(c
$(a${b$.(c}
$(a${b$.(c})
$(a${b$.(c))

The opening bracket cannot be omitted though. If the parser encounters a ) or }, but no expression of that type is currently being parsed, the bracket will be treated as a literal. Therefore, the closing brackets do not need to be escaped unless they are being used inside an expression of the corresponding type.

Stage type reference

This section specifies the exact semantics of every possible stage type, along with the available options.

Constant stages, K

As mentioned before, constant stages are technically atomic stages, but don't share many of the characteristics of all other atomic stages, so they're listed here separately.

The first source of a constant stage doesn't specify a regex but simply a constant literal string (after its configuration). By default, a constant stage simply discards its input and returns this constant string.

The following options are available to modify this behaviour:

  • Regex or string option: If a regex option is given, the input is only replaced by the constant if the regex matches the input. Otherwise, the input is returned. If a string option is given, the input is only replaced by the constant if it contains this string. If both are given, the regex takes precedence.
  • ^ (reverse): Only used in conjunction with a regex or string option. Negates the logic of the conditional. That is, the input is only replaced by the constant if the regex doesn't match the input or the input doesn't contain the string.

Atomic stages

Remember that all atomic stages have a common preprocess, which results in three lists: matches, separators, and substitutions. All atomic stages are defined in terms of these three lists and stage's input. This also means that all atomic stages support the match behaviour options, the regex modifiers and use their first limit to filter the match list. Any additional options and limits are specified here.

Some atomic stages have a default regex: when the stage uses only a single regex (so no #n option) and doesn't use the invert-matches option =, and the source of that single regex is empty, the regex defaults to a different source. This is usually the case when an empty regex would result in a stage which does nothing.

Also, some of the atomic stage specification may seem unnecessarily convoluted, but this is often to ensure that the stage still provides useful results when overlapping matches are used.

AntiGrep, A

This stage is used to discard lines from the input which contain matches.

The stage first splits its input into lines (usually by splitting around linefeed characters, but this can be configured). It will then mark all lines which intersect with at least one match. This intersection may be an empty string: if a match begins at the end of a line (but doesn't include the last character of the line), the line is still discarded.

Next, the marked lines are discarded. Finally, the remaining lines are returned as a list.

The following options are available:

  • Limit 2: The second limit is used on the marked lines, i.e. it filters the matched lines. Lines which aren't covered by the limit will be retained.
  • ? (random): Only one of the marked lines will be chosen (uniformly) at random and discarded. This happens after applying the second limit if one is given.
  • ^ (reverse): The final list of remaining lines is reversed before being returned.
  • Regex or string option: If a regex or string option is given, this regex or string is used to split the input into lines (instead of using the default of splitting around linefeed characters).
  • Supports list formatting options.

Note that there is currently no point in using the $ option with this stage.

Count, C

This is the default stage type if there is only a single source left for the final stage.

The stage simply returns the number of matches.

The only available options are those that are common to all atomic stages. Note that there is currently no point in using the $ option with this stage.

Deduplicate, D

This stage removes duplicate matches from the input.

First, the matches are grouped by the results of their corresponding substitutions (remember that the substitutions will be equal to the matches unless the $ option is used). From each group, the first match is kept. For all further matches, each character in the input which is covered by the match is marked. Finally, all marked characters are dropped from the input and the result is returned.

Deduplicate stages have the following default regex, which matches the individual lines of the input:

(?m:^.*$)

The following options are available:

  • Limit 2: The second limit filters which characters in each match will be marked for deletion.
  • ? (random): Instead of keeping the first match in each group, a random match will be kept.
  • ^ (reverse): Instead of keeping the first match in each group, the last match will be kept. ? takes precedence over ^.

Grep, G

This stage is used to find lines from the input which contain matches.

The stage first splits its input into lines (usually by splitting around linefeed characters, but this can be configured). It will then mark all lines which intersect with at least one match. This intersection may be an empty string: if a match begins at the end of a line (but doesn't include the last character of the line), the line is still kept.

Next, all unmarked lines are discarded. Finally, the remaining lines are returned as a list.

The following options are available:

  • Limit 2: The second limit is used on the marked lines, i.e. it filters the matched lines.
  • ? (random): Only one of the marked lines will be chosen (uniformly) at random and kept. This happens after applying the second limit if one is given.
  • ^ (reverse): The final list of remaining lines is reversed before being returned.
  • Regex or string option: If a regex or string option is given, this regex or string is used to split the input into lines (instead of using the default of splitting around linefeed characters).
  • Supports list formatting options.

Note that there is currently no point in using the $ option with this stage.

List, L

This stage simply returns all substitutions as a list (remember that the substitutions will be equal to the matches unless the $ option is used).

The following options are available:

  • Limit 2: The second limit filters the characters within each match.
  • ^ (reverse): Reverses the list of matches before returning it.
  • Supports list formatting options.

Pad, P

This stage is used to pad all matches to the same length.

First, the padding width, w, is determined as the maximum of the lengths of the substitutions (remember that the substitutions will be equal to the matches unless the $ option is used). Then, every match which is shorter than w characters gets padded on the right with the padding string to a width of w. Finally, the matches are reinserted between the separators and the result is returned.

By default, the padding string is a single space.

Note that this stage expands overlapping matches.

The following options are available:

  • String option: A string option can be used to change the padding string. If the padding string consists of multiple characters, these characters will be repeated cyclically.
  • ^ (reverse): The matches are padded on the left instead. Accordingly, multi-character padding strings will be extended cyclically to the left.

Position, I

This stage determines the (0-based) position of each match in the input string and returns the resulting positions as a list.

The following options are available:

  • ^ (reverse): Instead of determining the position where each match begins, this determines the position where each match ends.

Note that there is currently no point in using the $ option with this stage.

Replace, R

This is the default stage type if there are at least two sources left for the current stage.

This stage reinserts the substitutions between the separators and returns the result.

Note that this stage expands overlapping matches.

The following options are available:

  • Limit 2: The second limit filters the characters in each substitution.
  • ^ (reverse): The list of substitutions is reversed before being reinserted into the separators.

Note that the $ option is always active with this stage.

Reverse, V

This stage reverses the characters in each substitution and then reinserts the resulting strings into the separators (and returns the result).

Note that this stage expands overlapping matches.

Reverse stages have the following default regex, which matches the individual lines of the input:

(?m:^.*$)

The following options are available:

  • Limit 2: The second limit determines which characters should be reversed in each match. Specifically, the stage marks all characters covered by the limit, and then reverses only the marked characters. Unmarked characters remain in their original position.
  • ? (random): Instead of reversing the characters in each match, they are shuffled (uniformly). If a second limit is used, only the characters covered by the limit will be shuffled.
  • ^ (reverse): The list of (reversed) substitutions is itself reversed before being reinserted into the separators.

Sort, O and N

The list of matches is sorted by their corresponding substitutions (remember that the substitutions will be equal to the matches unless the $ option is used). If O is used, the substitutions are compared as strings. If N is used, each substitution is mapped to an integer by finding the first match of the regex -?\d+ inside the substitution (if not match is found, the substitution is mapped to 0), and the substitutions are compared by their numerical value. The sorting algorithm is stable. The sorted list of matches is then reinserted into the separators and the result is returned.

Note that this stage expands overlapping matches.

Sort stages have the following default regex, which matches the individual lines of the input:

(?m:^.*$)

The following options are available:

  • ? (random): Instead of sorting the matches, they are shuffled (uniformly). In this case, there is no point in using the $ or ^ option.
  • ^ (reverse): After sorting the matches, the list of matches is reversed.

Split, S

This stage is generally used to split the input around the matches of a regex (i.e. to get the list of separators). However, like .NET's Split method, capturing groups from the matches will by default also be included in the result.

Each match is mapped to a list of captures made by successful groups within that match (if a group resulted in multiple captures, only the last capture is used). These captures are then inserted between the separators and the result is returned as a list (of separators and captures).

The following options are available:

  • Limit 2: The second limit filters the elements of the final list.
  • Limit 3: The third limit filters which groups contribute their captures to the final list. The numbering of this limit is a bit awkward, because limits are 0-based but group numbers are 1-based. So a limit of 2 will actually select the group with the number 3.
  • ? (random): A single element from the final list will be picked (uniformly) at random and returned.
  • ^ (reverse): The final list is reversed.
  • !- (omit groups): Groups are not inserted into the list and only the separators are returned.
  • !_ (omit empty): Empty separators are dropped from the final list (empty groups will be kept though).

Note that there is currently no point in using the $ option with this stage.

Transliterate, T and Y

This stage is used to map individual characters to other characters. Its syntax is slightly different from other stages. Its first source (after processing the configuration) doesn't just contain its primary regex, but actually up to three parts. Including the configuration, all of the following are valid Transliteration stages:

config`from
config`from`to
config`from`to`regex

If to is not given or empty, it defaults to _. If the regex is not given or empty it defaults to the following regex which matches the entire input:

\A(?s:.*)\z

First, both from and to are expanded to lists of characters, using similar rules to character classes in regex:

  • \ escapes the next character. The following escape sequences are known (only for convenience — you can also embed most of the characters directly in the source code): \a (bell, 0x07), \b (backspace, 0x08), \f (form feed, 0x0C), \n (linefeed, 0x0A, if you want to embed this literally, use ), \r (carriage return, 0x0D), \t (tab, 0x09), \v (vertical tab, 0x0B), (pilcrow, 0xB6). If a \ is followed by any other character, it just escapes that next character (this let's you add, e.g., ` or \ itself to the list). If there is a single \ left at the end of the stage, it represents a literal backslash.
  • If - is preceded and followed by a single character or escape sequence it denotes a range. As opposed to regex, ranges can be both ascending and descending. 0-4 denotes 01234, but d-a denotes dcba. Ranges can also be degenerate, e.g. a-a (representing just a). Backticks have to be escaped. A hyphen at the beginning or end of the current segment, or one which follows immediately after another range is treated literally. The character classes listed below are ignored when they appear as one end of a range.
  • There are several built-in character classes. As opposed to regex, they do not need a backslash, so be careful when using literal letters in your code:
    • d is equivalent to 0-9.
    • E is equivalent to 02468.
    • O is equivalent to 13579.
    • H is equivalent to 0-9A-F.
    • h is equivalent to 0-9a-f.
    • L is equivalent to A-Z.
    • l is equivalent to a-z.
    • V is equivalent to AEIOU.
    • v is equivalent to aeiou.
    • w is equivalent to _0-9A-Za-z.
    • p is equivalent to <sp>-~, where <sp> is a space.
    • The first o inserts the other set. That is, if o is used in from it inserts to. If o is used in to it inserts from. Subsequent occurrences of o are treated as literals. If o is used in both from and to it is treated as a literal everywhere.
  • Preceding any range or built-in character class with R reverses that range (this also works for o). Multiple Rs can be used as well, although that is generally useless, unless the code was created dynamically for an eval stage (an even number of R is a no-op, an odd number reverses the range). While R does work with ranges, it's simpler to just use the opposite range: Ra-z is the same as z-a. R not followed by a range or class is treated as a literal.
  • _ is a special blank. It doesn't represent any character when used in the from set. And it can be used in the to set to delete characters.

After generating these two lists, the stage determines how often to transliterate each character, by counting how many matches cover this character. This will usually be 0 or 1, but if overlapping matches are used, a character may be covered by several matches and be transliterated multiple times.

The stage then goes through the input and transliterates each character as often as required. How this transliteration works depends on whether T (regular) or Y (cyclic) transliteration stage is being used:

  • T (regular): If the current character appears in the from list, it gets mapped to the character at the same position in the to list. If the to list contains a blank (_) in this position, the character is deleted instead (and will not be transliterated any further). If the character appears multiple times in from, only the first occurrence is used. If the to list is shorter than the from list and the current character appears after the end of the to list, it gets mapped to the last character in the to list (alternatively, you can think of it as padding to to the length of from using the last character in to).

  • Y (cyclic): This mode is a bit more complicated. First, you should think of from and to as infinite lists of characters, generated by repeating the finite lists indefinitely. As an example, consider the following stage:

    Y`abc`12
    

    You should think of the from list as abcabcabca... and the to list as 1212121212.... These are now combined into a single (infinite) list of mappings by mapping characters at corresponding positions in from to those in to. So we would get the infinite list of mappings:

    a -> 1
    b -> 2
    c -> 1
    a -> 2
    b -> 1
    c -> 2
    a -> 1
     ...
    

    Now, to transliterate a character, we find the first mapping with that character on the left-hand side and map it to the corresponding character on the right-hand side. Then we delete this mapping from the infinite list. Therefore, the next occurrence of this character might be transliterated differently. This particular example would transliterate aaaa to 1212, bbbb to 2121 and cccc to 1212. Again, if a character is mapped to a blank (_) it gets deleted from the input instead and will not be transliterated any further.

The stage returns the final result of this process.

The following options are available:

  • Limit 2: The second limit selects which characters in each match to transliterate, or in other words which characters in each match will have their transliteration counter increased.
  • ^ (reverse): After determining how often to transliterate each character, iterate through the characters in from right to left. This can only affect the result for cyclic (Y-type) transliterations.
  • !| (transliterate once): Each character will be transliterated at most once, even if it's covered by multiple matches.

Note that there is currently no point in using the $ option with this stage.

Compound stages

Finally, this section lists the exact semantics of each compound stage. Remember that (except for groups), each compound stage has exactly one child stage, and does not share the common preprocess with atomic stages (in general, compound stages don't even have a regex associated with them).

Group, ()

Groups are somewhat special in that they contain an arbitrary number of child stage (although usually at least two), so we consider them first.

By default, a group stage simply executes all its child stages in order (each one modifying the working string). They are mostly used to feed multiple stages as a single block to other compound stages.

The following options are available to modify this behaviour:

  • Regex option: If a regex option is given, the group becomes an if/else-type construct. If the regex matches the stage's input, only the group's first child stage is executed. Otherwise, all stages except the first are executed in order.
  • String option: Instead of a regex, one can also specify a string, in which case the first child stage is executed if the input contains this string (and otherwise the remaining children are executed). If a regex option is also given, the regex takes precedence.
  • ^ (reverse): When used in conjunction with a regex or string option, the logic of the condition is negated. That is, the first child stage is executed if the regex doesn't match the input or the input doesn't contain the string (and the remaining children otherwise).
  • ? (random): Only one of the child stages is picked (uniformly) at random and executed. If a regex or string option is also given, the random option takes precedence.

Conditional, &

Without additional configuration, this stage does nothing, but it is generally used with a regex option. If a regex option is given, the child stage is only executed if the regex matches the input. Otherwise, the input is returned unchanged.

The following options are available:

  • Regex option: Specifies the regex which determines whether the child stage is executed. Usually required.
  • String option: Instead of a regex, one can also specify a string, in which case the child stage is executed if the input contains this string. If a regex option is also given, the regex takes precedence.
  • ? (random): Instead of either a regex or string option, the random option can also be used, in which case the child stage is executed with a probability of 50%.
  • ^ (reverse): When used in conjunction with a regex or string option, the logic of the condition is negated. That is, the child stage is executed if the regex doesn't match the input or the input doesn't contain the string.

Dry run, *

A dry run is used to run a child stage without affecting the working string. This can be useful for printing an intermediate result, or storing it in the history. Specifically, this stages executes its child stage, but then returns the input (instead of the child's result).

Dry runs do not register their results with the history, unless they have a regex or string option.

The following options are available:

  • Regex option: If a regex option is given, the stage tries to match the child's result with the regex. If the regex matches, the child's result is returned, otherwise the input is returned.
  • String option: Instead of a regex, one can also specify a string, in which case the child's result is returned if it contains this string. If a regex option is also given, the regex takes precedence.
  • ^ (reverse): When used in conjunction with a regex or string option, the logic of the condition is negated. That is, the child's result is only returned only if the regex doesn't match the it or it doesn't contain the string.

Eval, ~

Eval stages let you execute an arbitrary string as Retina code.

The eval stage first executes its child stage, which should generate the program's source code. The stage will then start up a new instance of the Retina interpreter (not a actually a separate process) and feed its child stage's result to this new instance as the source code. The eval stage's input is also used as the inner program's input.

The inner program has its own history (so it can neither access history data of the main program, nor can it affect the main program's history), and its standard output stream is captured into a string. Once the inner program terminates, the captured stream is returned as the eval stage's output. Note that it's impossible to print directly to the main program's standard output stream from within the inner program.

The following options are available:

  • String option: Instead of feeding the eval stage's input directly to the inner program, this input can be determined with a substitution. This is done by treating the string option as a substitution pattern for the regex \A(?s:.*)\z (which matches the entire input). The substitution will often be based on a history element $+n or $-n.
  • ^ (reverse): This reverses the roles of the eval stage's input and its child stage's result. In other words, the stage's input is used as the inner program's source code, and the child stage's result is used as the inner program's input. If used in conjunction with a string option, the inner program's source code is determined by the substitution.

Loop, +

This stage is used to run its child multiple times in a loop. The type of loop depends on the stage's configuration. By default, a convergence loop is used. That means the child is repeatedly executed, until its result is equal to its input (so this iteration of the child stage didn't affect the working string any more). The child will always be executed at least once.

The following options are available to implement different kinds of loops:

  • ? (random): Before each iteration, there's a 50% chance that the loop ends. Otherwise, the child stage is executed again. So there's a 50% chance that there will be no iteration, 25% chance that there will one iteration, 12.5% chance that there will be two iterations and so on. This loop will continue even if the working string stops changing (this is relevant for side effects like output and the history).
  • Regex option: If a regex option is given, the stage becomes a while-loop. As long as the regex matches the working string, the child stage is executed again.
  • ^ (reverse): When used in conjunction with a regex option, the stage becomes an until-loop instead: as long as the regex doesn't match the working string, the child stage is executed again.
  • Integer option: This is actually notational abuse of a limit. Only the limit's upper end is used, so it there's no point in using anything but an exact limit here, which is just a single integer n. If n is positive, the loop remains a convergence loop, but it will terminate after at most n iterations (even if convergence was not reached). If n is negative, the loop will run for exactly |n| iterations, even if the working string stops changing (this is relevant for side effects like output and the history).
  • String option: Instead of an integer option, a string option can be used to choose the iteration limit dynamically. This is done by treating the string as a substitution pattern for the regex \A(?s:.*)\z (which matches the entire input). The substitution will often be based on the input's length $.& or a history element $+n or $-n. The first match of the regex -?\d+ in the substitution result will be used as the iteration count. The sign is treated the same way as for an integer option.

Match mask, _

This stage can be used to apply a child stage only to the matches of a given regex.

Specifically, the stage finds all matches of its regex in the input, and then invokes its child stage once for each match, passing only that match (without any context) in as the input. The matches are then replaced by the corresponding results of the child stage.

The stage's regex defaults to (?m:^.*$), which matches each line of the input, but it will often be replaced by an explicit regex option. Note that without further options match mask stages and per-line stages are functionally identical.

The following options are available:

  • Regex option: If a regex is given, this regex is used instead of the default one to mask the input.
  • String option: If a string is given instead of a regex, it will be converted to a regex by escaping it (so it's just a regex that matches the string literally).
  • Limit 1: The first limit is used to filter the list of matches, similar to the its behaviour for atomic stages.
  • @ (single random match): After applying the limit, a single random match is picked for processing (all others remain unchanged).
  • ^ (reverse): The matches are passed to the child stage from last to first. This can be relevant for side effects like output and the history.

Output, >, <, \, ;

Output stages can be introduced with four different characters, but they all represent the same stage type, just with different internal flags set. Output stages do not register their results with the history. This is because, output stages are purely used for their side effect and will never affect the working string themselves.

Here are the differences between the four different types of output stages:

  • >: This is the standard output stage. It executes its child stage, prints its result to the standard output stream, and returns that result.
  • <: This is a pre-print stage. It prints its input to the standard output stream, then executes its child stage, and returns the result.
  • ;: Is similar to >, but it prints the result of the child stage only if it differs from the input (i.e. only if the child stage changed the working string). This is useful for printing the iterations of convergence loops, where the last iteration will leave the string unchanged (and one often does not want to print this final result twice).
  • \: This is shorthand for a >-type output stage which already has a string option containing a single linefeed (0x0A).

The way these different types of output stages actually do the printing can be modified with the following options:

  • Limit 1: The first limit is used to filter the characters of the output.
  • String option: If a string option is given, this string is printed after the stage's regular output. This option is implicitly given as for \-type output stages. If a limit is also given, it does not affect this suffix (only the characters of the main output are filtered).
  • ? (random): The stage prints its output only with a 50% probability.

Per-line, %

This stage can be used to apply a child stage to the individual lines its input.

Specifically, the stage splits the input into lines using its line separator, and then invokes its child stage once for each match, passing only that line (without any context) in as the input. The lines are then replaced by the corresponding results of the child stage (and returned with the line separators reinserted).

The stage's line separator defaults to (a single linefeed character), but it replaced by an explicit regex or string option. Note that without further options match mask stages and per-line stages are functionally identical.

The following options are available:

  • Regex option: If a regex is given, this regex is used instead of the default one to split the input into lines.
  • String option: If a string is given instead of a regex, this string will be used as the line separator.
  • @ (single random match): A single line separator is picked (uniformly) at random and the input is split only into two lines around this separator.
  • Limit 1: The first limit is used to filter the list of lines. Only the lines covered by the limit will be processed by the child stage.
  • ? (random): After applying the limit, a single line is picked (uniformly) at random, and only this line will be processed by the child stage.
  • ^ (reverse): The lines are passed to the child stage from last to first. This can be relevant for side effects like output and the history.