Skip to content

Regex Manual

Paul Manias edited this page May 2, 2026 · 5 revisions

Regex Manual

This manual provides comprehensive technical documentation for regular expression support in Kōtuku. It covers pattern syntax, behaviour, and all supported features for both Tiri and C++ developers.

For information on how to use the Regex class and its methods in your code, please refer to the Regex module API documentation.

Table of Contents


Introduction

Kōtuku's regex support is based on the regular expression syntax defined in the ECMAScript Specification. The implementation provides full Unicode support with UTF-8 encoding enabled by default.

Key features include:

  • ECMAScript Compliance: Supports expressions defined in the latest ECMAScript specification draft
  • Unicode Support: Full UTF-8 Unicode support with property matching
  • Set Operations: Character class intersection, subtraction, and string sequences
  • Named Captures: Named and numbered capture groups with backreferences
  • Lookahead and Lookbehind: Zero-width assertions for complex matching
  • Flag Modifiers: Inline flag control for case sensitivity, multiline, and dotall modes

Regular expressions are compiled into pattern objects that can be reused for efficient matching operations.


Character Matching

Regular expressions match characters in the target text based on pattern specifications. The following table describes all character matching forms:

Pattern Description Example
. Matches any character except line terminators (U+000A, U+000D, U+2028, U+2029). With the dotall flag, matches every code point. a.c matches "abc", "aXc"
\0 Matches NULL character (U+0000) \0 matches null byte
\t Matches Horizontal Tab (U+0009) a\tb matches "a b"
\n Matches Line Feed (U+000A) a\nb matches "a\nb"
\v Matches Vertical Tab (U+000B) \v matches vertical tab
\f Matches Form Feed (U+000C) \f matches form feed
\r Matches Carriage Return (U+000D) \r\n matches Windows line ending
\cX Matches control character where X is A-Z or a-z. Value is (code point of X) & 0x1F \cA matches Ctrl-A (U+0001)
\\ Matches backslash character (U+005C) \\ matches "\"
\xHH Matches character with hexadecimal code HH (00-FF) \x41 matches "A"
\uHHHH Matches character with Unicode code point HHHH \u0041 matches "A"
\u{H...} Matches character with Unicode code point represented by hex digits (up to 10FFFF) \u{1F600} matches 😀
\X When X is one of ^ $ . * + ? ( ) [ ] { } | /, matches X literally \( matches "("
Character Any character not listed above matches itself abc matches "abc"

Line Terminators

Line terminator code points are: U+000A (Line Feed), U+000D (Carriage Return), U+2028 (Line Separator), and U+2029 (Paragraph Separator).

Escape Sequences

All escape sequences must be complete and valid. If \c is not followed by a letter A-Z or a-z, \x is not followed by two hexadecimal digits, \u is not followed by four hexadecimal digits, or \u{...} does not contain valid hexadecimal or exceeds U+10FFFF, an error_escape exception is thrown.

Special Character Escaping

In character classes (see Character Classes), the hyphen - can also be escaped as \-. The character ] must always be escaped as \] to be matched literally.


Alternatives

The | operator matches one of multiple alternative patterns, evaluated from left to right:

A|B|C

This matches pattern A, or pattern B, or pattern C. The first successful match is adopted, and remaining alternatives are not evaluated.

Example

pattern = regex.new('abc|abcdef')
match = pattern.match('abcdef')
-- match[1] = "abc" (not "abcdef")

Even though "abcdef" would match the second alternative completely, the pattern matches "abc" from the first alternative because alternatives are evaluated left to right.

Multiple alternatives can be combined:

pattern = regex.new('cat|dog|bird|fish')

Character Classes

Character classes define sets of characters that can match at a single position in the target text.

Basic Character Classes

A character class is enclosed in square brackets [...] and matches any single character from the set:

Pattern Description Example
[ABC] Matches any of A, B, or C [ABC] matches "A", "B", or "C"
[^DEF] Matches any character except D, E, or F (negated class) [^DEF] matches any character but "D", "E", "F"
[G^H] Matches G, ^, or H (^ not first, so literal) [G^H] matches "G", "^", or "H"
[I-K] Matches any character from I to K inclusive (range) [I-K] matches "I", "J", "K"
[-LM] Matches -, L, or M (leading hyphen is literal) [-LM] matches "-", "L", "M"
[N-P-R] Matches N, O, P, -, or R (trailing hyphen after range is literal) [N-P-R] matches "N", "O", "P", "-", "R"
[S\-U] Matches S, -, or U (escaped hyphen) [S\-U] matches "S", "-", "U"
[.({|] Special regex characters lose their special meaning in character classes [.({|] matches ".", "(", "{", "|"
[] Empty class matches no code points (always fails) [] never matches
[^] Complement of empty class matches any code point [^] matches any character including line terminators

Character Class Rules

  1. Negation: When ^ is the first character in [], the class is negated and matches any character NOT in the set.
  2. Closing Bracket: The ] character must always be escaped as \] to be included literally in a character class.
  3. Hyphen: The - character creates a range when between two characters. To match - literally, place it first, last, or escape it as \-.
  4. Special Characters: Most special regex characters (., *, +, etc.) lose their special meaning inside character classes.

Character Ranges

Ranges define a span of consecutive Unicode code points:

pattern = regex.new('[A-Z]')     -- Matches any uppercase letter A-Z
pattern = regex.new('[0-9]')     -- Matches any digit
pattern = regex.new('[a-zA-Z]')  -- Matches any letter

If the range is invalid (e.g., [b-a] where the starting code point is greater than the ending code point), an error_range exception is thrown.

Case-Insensitive Matching

When case-insensitive matching is enabled (with the icase flag), character classes expand to include case-folded variations:

pattern = regex.new('[E-F]', regex.ICASE)
-- Matches 'E', 'F', 'e', 'f', and any Unicode case variants

Note: Range [E-f] with icase flag will match all characters from U+0045 ('E') to U+0066 ('f'), including brackets, backslash, and other punctuation, plus their case-folded variants.

Predefined Character Classes

Predefined character classes provide convenient shortcuts for common character sets:

Pattern Equivalent Description
\d [0-9] Matches any decimal digit
\D [^0-9] Matches any non-digit
\s [ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] Matches any whitespace character (WhiteSpace + LineTerminator)
\S [^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] Matches any non-whitespace
\w [0-9A-Za-z_] Matches any word character (alphanumeric + underscore)
\W [^0-9A-Za-z_] Matches any non-word character
\p{...} (See Unicode Support) Matches characters with specified Unicode property
\P{...} (See Unicode Support) Matches characters without specified Unicode property

All predefined character classes can be used inside character classes:

pattern = regex.new('[\\d!\"#$%&\'()]')  -- Matches digits or punctuation

Note: The \s whitespace class automatically expands when new code points are added to Unicode category Zs.

Set Operations

Character classes support advanced set operations for precise character matching. These operations are always available as standard features.

Intersection with &&

The intersection operator && matches characters that belong to both sets:

[A&&B]

Examples:

-- Match lowercase Latin letters only
pattern = regex.new('[\\p{sc=Latin}&&\\p{Ll}]')
-- Matches: a, b, c, ..., z, ñ, ø, etc. (lowercase Latin)
-- Does NOT match: A, B, C, ... (not lowercase)

-- Match ASCII letters only (not extended Latin)
pattern = regex.new('[\\p{sc=Latin}&&[A-Za-z]]')

Subtraction with --

The subtraction operator -- matches characters in the first set but not in the second:

[A--B]

Examples:

-- Match Latin letters that are NOT lowercase
pattern = regex.new('[\\p{sc=Latin}--\\p{Ll}]')
-- Matches: A, B, C, ..., Z (uppercase, titlecase, etc.)
-- Does NOT match: a, b, c, ... (lowercase excluded)

-- Match letters except vowels
pattern = regex.new('[A-Za-z--[AEIOUaeiou]]')
-- Matches: consonants only

String Sequences with \q{...}

The \q{...} syntax allows character classes to match multi-character sequences:

[a-z\q{ch|th|ph}]

This matches either:

  • Any single character from a-z, OR
  • The sequence "ch", OR
  • The sequence "th", OR
  • The sequence "ph"

Longest Match Priority: When strings are included in a character class, the longest matching string is always selected first:

pattern = regex.new('[a-z\\q{ch|chocolate}]')
-- When matching "chocolate", matches the full word "chocolate"
-- Not "ch" followed by "ocolate"

The sequence [a-z\q{ch|th|ph}] is functionally equivalent to (?:ch|th|ph|[a-z]).

Examples:

-- Match common digraphs or single letters
pattern = regex.new('[a-z\\q{ch|sh|th|ph}]+')

-- Match emoji sequences or letters
pattern = regex.new('[A-Z\\q{:-)|:-(|:-D}]')

String sequences can be used with all set operators (union, intersection, subtraction).

Nesting Character Classes

Character classes can be nested as operands for set operations:

-- Valid: nested classes with operators
pattern = regex.new('[\\p{sc=Latin}--[a-z]]')

-- Valid: nested union and subtraction
pattern = regex.new('[A[B--C]D]')

Operator Restriction: Only one type of operator can be used per level of nesting:

-- INVALID: mixing && and -- at same level
[AB--CD]           -- Error: union (AB) then subtraction (--)

-- VALID: operators in different nesting levels
[[AB]--[CD]]       -- OK: separate nesting levels
[A[B--C]D]         -- OK: subtraction inside union

Multiple uses of the same operator are permitted:

-- Valid: multiple subtractions at same level
[\\p{sc=Latin}--\\p{Lu}--[a-z]]

Character Escaping in Character Classes

The following characters must be escaped with \ when used literally in character classes:

  • (, ), [, {, }, /, -, |
  • ] must always be escaped (even outside character classes)
-- Correct
pattern = regex.new('[\\(\\)\\[\\]\\{\\}]')

-- Incorrect (throws error_noescape)
pattern = regex.new('[(]')

Reserved Double Punctuators

The following 18 double-character sequences are reserved for future use and cannot appear in character classes:

!!  ##  $$  %%  **  ++  ,,  ..
::  ;;  <<  ==  >>  ??  @@  ^^
``  ~~

If any of these appear in a character class, an error_operator exception is thrown.


Quantifiers

Quantifiers specify how many times a pattern element must match. Each quantifier has a greedy and non-greedy form.

Quantifier Non-Greedy Matches Description
* *? 0 or more Repeats the preceding element zero or more times
+ +? 1 or more Repeats the preceding element one or more times
? ?? 0 or 1 Makes the preceding element optional
{n} N/A Exactly n Repeats the preceding element exactly n times
{n,} {n,}? n or more Repeats the preceding element at least n times
{n,m} {n,m}? n to m Repeats the preceding element between n and m times (inclusive)

Greedy vs Non-Greedy

Greedy quantifiers (default) match as many characters as possible while still allowing the overall pattern to succeed:

pattern = regex.new('a.*b')
match = pattern.match('axxxbxxxb')
-- match[1] = "axxxbxxxb" (matches up to the last 'b')

Non-greedy quantifiers (with ? suffix) match as few characters as possible while still allowing the overall pattern to succeed:

pattern = regex.new('a.*?b')
match = pattern.match('axxxbxxxb')
-- match[1] = "axxxb" (stops at the first 'b')

Quantifier Rules

  1. Quantifiers must have a preceding expression to quantify. Using a quantifier without a preceding element (e.g., * at the start of a pattern) throws error_badrepeat.

  2. If a quantifier range is invalid (e.g., {3,2} where n > m), an error_badbrace exception is thrown.

  3. Mismatched { or } characters throw error_brace.

Examples

-- Match one or more digits
pattern = regex.new('\\d+')

-- Match optional sign followed by digits
pattern = regex.new('[+-]?\\d+')

-- Match exactly 3 letters
pattern = regex.new('[A-Za-z]{3}')

-- Match 2 to 4 word characters (greedy)
pattern = regex.new('\\w{2,4}')

-- Match 2 to 4 word characters (non-greedy)
pattern = regex.new('\\w{2,4}?')

-- Match at least 5 digits
pattern = regex.new('\\d{5,}')

Quantifiers and Captured Groups

When a capturing group is quantified, the captured value is updated on each iteration. Only the last iteration's match is preserved:

pattern = regex.new('(?:(a)|(b))+')
match = pattern.match('ab')
-- match[1] = "ab" (full match)
-- match[2] = "" (empty, last iteration captured 'b', not 'a')
-- match[3] = "b" (last iteration captured 'b')

Grouping and Backreferences

Parentheses create groups for capturing matches and controlling operator precedence.

Capturing Groups

Capturing groups are created with (...) and are numbered starting from 1:

pattern = regex.new('(\\d{3})-(\\d{3})-(\\d{4})')
match = pattern.match('555-123-4567')
-- match[1] = "555-123-4567" (full match, always at index 1)
-- match[2] = "555" (first capturing group)
-- match[3] = "123" (second capturing group)
-- match[4] = "4567" (third capturing group)

Group Numbering: Groups are numbered by the position of their opening ( parenthesis from left to right:

pattern = regex.new('((a)(b))c')
-- Group 1: ((a)(b))
-- Group 2: (a)
-- Group 3: (b)

Non-Capturing Groups

Non-capturing groups (?:...) group expressions without creating a capture:

pattern = regex.new('(?:tak(?:e|ing))')
-- Matches "take" or "taking" without capturing

Use non-capturing groups to:

  • Apply quantifiers to multiple characters: (?:ab)+
  • Group alternatives: (?:cat|dog)
  • Improve performance (slightly faster than capturing groups)

Named Capture Groups

Named groups associate a name with a captured substring:

(?<name>...)

Example:

pattern = regex.new('(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})')
match = pattern.match('2025-10-14')
-- match[1] = "2025-10-14"
-- match[2] = "2025" (group 1, also accessible as 'year')
-- match[3] = "10" (group 2, also accessible as 'month')
-- match[4] = "14" (group 3, also accessible as 'day')

Named groups are also assigned a number and can be accessed by both name and number.

Duplicate Named Groups

Named groups can be reused if they appear in different alternatives:

pattern = regex.new('(?<year>\\d{4})-\\d{1,2}|\\d{1,2}-(?<year>\\d{4})')
-- Matches "2025-10" or "10-2025"
-- 'year' captures the 4-digit year from either position

This feature was introduced in ES2025.

Numeric Backreferences

A backreference \N (where N is a positive integer starting from 1) matches the same text that was captured by group N:

pattern = regex.new('(TO|to)..\\1')
-- Matches "TOMATO" or "tomato" but not "Tomato"
-- \1 refers to captured text from group 1

Example:

pattern = regex.new('(["\']).*?\\1')
-- Matches string in quotes: "hello" or 'hello'
-- But not mixed quotes: "hello'

Named Backreferences

A backreference \k<name> matches the text captured by a named group:

pattern = regex.new('(?<quote>["\']).*?\\k<quote>')
-- Same as above, but using named group

Backreference Rules

  1. Forward References: Backreferences can appear before their corresponding group:

    pattern = regex.new('\\1(abc)')  -- Valid in ECMAScript
  2. Undefined Matches: A backreference to a group that hasn't captured anything matches the empty string:

    pattern = regex.new('(a)?b\\1')
    -- Matches "b" (group 1 didn't capture, so \1 matches empty string)
  3. Invalid Groups: If a backreference refers to a non-existent group number, an error_backref exception is thrown:

    pattern = regex.new('\\5')  -- Error: no group 5 exists

Group Capture Clearing

When a capturing group is inside a quantified expression, captures are cleared on each iteration:

pattern = regex.new('(?:(a)|(b))+')
match = pattern.match('ab')
-- Only the last iteration's captures are retained
-- match[2] = "" (group 1's last iteration matched nothing)
-- match[3] = "b" (group 2's last iteration matched "b")

Flag Modifiers

Flag modifiers allow inline control of matching behaviour within specific parts of a pattern.

Bounded Flag Modifiers

Bounded flag modifiers enable or disable flags only within a specific group:

(?ims-ims:...)

Available Flags:

Flag Meaning
i Case-insensitive matching (icase)
m Multiline mode (^ and $ match line boundaries)
s Dotall mode (. matches line terminators)
-i Disable case-insensitive matching
-m Disable multiline mode
-s Disable dotall mode

Examples:

-- Case-insensitive only for middle section
pattern = regex.new('hello(?i:world)THERE')
-- Matches: "helloworldTHERE", "helloWORLDTHERE", "helloWoRlDTHERE"
-- Does NOT match: "HELLOworldthere" (case-sensitive outside group)

-- Combine multiple flags
pattern = regex.new('(?ims:.*)')
-- Case-insensitive + multiline + dotall for entire group

-- Disable flags
pattern = regex.new('(?i)hello(?-i:world)')
-- "hello" is case-insensitive, "world" is case-sensitive

Flag Modifier Rules

  1. Single Use per Flag: Each flag letter can only appear once per modifier group:

    -- INVALID: 'i' appears twice
    (?ii:...)      -- Throws error_modifier
    (?i-i:...)     -- Throws error_modifier
  2. Scope: Flag modifiers affect only the expressions inside their group.

  3. ES2025 Feature: Bounded flag modifiers were introduced in ES2025 and are enabled by default.


Assertions

Assertions test conditions at the current position without consuming characters (zero-width).

Anchors

Assertion Description
^ Matches at the start of the string. With multiline flag, also matches immediately after line terminators.
$ Matches at the end of the string. With multiline flag, also matches immediately before line terminators.

Examples:

-- Match lines starting with "#"
pattern = regex.new('^#.*', regex.MULTILINE)

-- Match lines ending with ";"
pattern = regex.new('.*;$', regex.MULTILINE)

Word Boundaries

Assertion Description
\b Matches at a word boundary (between \w and \W)
\B Matches at a non-word boundary (not between \w and \W)

Examples:

-- Match "cat" as a whole word
pattern = regex.new('\\bcat\\b')
-- Matches: "cat in hat"
-- Does NOT match: "concatenate"

-- Match "cat" not as a whole word
pattern = regex.new('\\Bcat\\B')
-- Matches: "concatenate"
-- Does NOT match: "cat in hat"

Note: Inside a character class [...], \b matches the BEL character (U+0008), not a word boundary. Using \B inside a character class throws error_escape.

Lookahead Assertions

Lookahead assertions check if a pattern matches ahead without consuming characters:

Assertion Description
(?=...) Positive lookahead: succeeds if pattern matches ahead
(?!...) Negative lookahead: succeeds if pattern does NOT match ahead

Examples:

-- Match "a" only if followed by "bc" or "def"
pattern = regex.new('a(?=bc|def)')
-- Matches: "abc" (captures "a"), "adef" (captures "a")
-- Does NOT match: "axyz"

-- Match "a" only if NOT followed by "bc" or "def"
pattern = regex.new('a(?!bc|def)')
-- Matches: "axyz" (captures "a")
-- Does NOT match: "abc", "adef"

-- Find & symbols that are not HTML entities
pattern = regex.new('&(?!amp;|lt;|gt;|#)')
-- Matches bare "&" but not "&amp;", "&lt;", etc.

Lookbehind Assertions

Lookbehind assertions check if a pattern matches behind without consuming characters:

Assertion Description
(?<=...) Positive lookbehind: succeeds if pattern matches behind
(?<!...) Negative lookbehind: succeeds if pattern does NOT match behind

Examples:

-- Match "a" only if preceded by "bc" or "de"
pattern = regex.new('(?<=bc|de)a')
-- Matches: "bca" (captures "a"), "dea" (captures "a")
-- Does NOT match: "xa"

-- Match "a" only if NOT preceded by "bc" or "de"
pattern = regex.new('(?<!bc|de)a')
-- Matches: "xa" (captures "a")
-- Does NOT match: "bca", "dea"

Assertion Combinations

Assertions can be combined for complex matching:

-- Match words between 3-6 letters containing at least one vowel
pattern = regex.new('\\b(?=\\w*[aeiou])\\w{3,6}\\b', regex.ICASE)

-- Match integer strings that are not part of larger numbers
pattern = regex.new('(?<!\\d)\\d+(?!\\d)')

Unicode Support

Kōtuku's regex implementation provides full Unicode support with UTF-8 encoding enabled by default.

Unicode Properties

Unicode properties match characters based on their Unicode characteristics using \p{...} and \P{...}:

Pattern Description
\p{Property} Matches characters with the specified Unicode property
\P{Property} Matches characters without the specified Unicode property

Common Unicode Properties

Script Properties

Match characters from specific writing systems:

-- Match Latin characters
pattern = regex.new('\\p{sc=Latin}+')

-- Match Greek characters
pattern = regex.new('\\p{Script=Greek}+')

-- Match characters used in Latin or Common scripts
pattern = regex.new('\\p{scx=Latin}+')

Common scripts: Latin, Greek, Cyrillic, Han, Arabic, Hebrew, Hiragana, Katakana, etc.

General Categories

Match characters by their general category:

Property Description Examples
\p{Lu} Uppercase letter A, B, Z, À, Ω
\p{Ll} Lowercase letter a, b, z, à, ω
\p{Lt} Titlecase letter Dž, Lj, Nj
\p{L} Any letter (Lu|Ll|Lt|Lm|Lo) All letters
\p{Nd} Decimal number 0-9, ০-৯
\p{N} Any number (Nd|Nl|No) All numbers
\p{P} Punctuation ., !, ?, ;
\p{S} Symbol $, +, =, ©
\p{Z} Separator Space, non-breaking space
\p{C} Other (control, format, etc.) Control characters

Examples:

-- Match any letter in any script
pattern = regex.new('\\p{L}+')

-- Match digits in any script
pattern = regex.new('\\p{Nd}+')

-- Match all punctuation
pattern = regex.new('\\p{P}+')

Binary Properties

Binary properties have true/false values:

-- Match whitespace characters
pattern = regex.new('\\p{White_Space}+')

-- Match emoji
pattern = regex.new('\\p{Emoji}')

-- Match characters used in identifiers
pattern = regex.new('\\p{ID_Start}\\p{ID_Continue}*')

Unicode Property Syntax

Properties can be specified in several formats:

-- Short form
\\p{Lu}              -- Uppercase letter
\\p{sc=Latin}        -- Latin script

-- Long form
\\p{Script=Latin}
\\p{General_Category=Uppercase_Letter}

-- Binary properties
\\p{Emoji}
\\p{White_Space}

For a complete list of available properties, see the ECMAScript Unicode Property Table.

String Properties

Some Unicode properties match sequences of multiple characters (string properties). These can be used in character classes except negated classes:

-- Valid: string property in positive class
pattern = regex.new('[\\p{RGI_Emoji}]')

-- INVALID: string property with negation
pattern = regex.new('[^\\p{RGI_Emoji}]')  -- Throws error_complement

-- INVALID: string property with \P{...}
pattern = regex.new('\\P{RGI_Emoji}')     -- Throws error_complement

Unicode Case Folding

When case-insensitive matching is enabled with the icase flag, Unicode case folding rules apply:

pattern = regex.new('café', regex.ICASE)
-- Matches: "café", "CAFÉ", "Café", "cAfÉ", etc.

Case folding follows Unicode rules, which may match more characters than simple ASCII uppercasing/lowercasing:

pattern = regex.new('ß', regex.ICASE)
-- Matches: "ß" and "SS" (German sharp S case-folds to SS)

Unicode Code Point Ranges

Character classes operate on Unicode code points:

-- Match all characters in Basic Multilingual Plane
pattern = regex.new('[\\u0000-\\uFFFF]+')

-- Match emoji range (partial)
pattern = regex.new('[\\u{1F600}-\\u{1F64F}]+')

Invalid UTF-8 Handling

The regex engine validates UTF-8 sequences:

  1. Trailing bytes must be in range 0x80-0xBF. Invalid trailing bytes cause matching to fail at that position.

  2. Code points must be ≤ 0x10FFFF. Values exceeding this cause matching to fail.

  3. Non-shortest forms are rejected. For example, U+0030 (digit '0') must be encoded as 0x30, not as the longer forms 0xC0 0xB0 or 0xE0 0x80 0xB0.

At pattern compile time, invalid UTF-8 throws error_utf8. At matching time, invalid UTF-8 leads to match failure at that position.


Compilation Flags

Compilation flags affect how a regex pattern is compiled and interpreted. These flags are specified when creating a regex object.

Flag Effect
ICASE Case-insensitive matching. Matches characters regardless of case using Unicode case-folding rules.
MULTILINE Multiline mode. The ^ and $ anchors match at line boundaries (after/before line terminators) in addition to string boundaries.
DOT_ALL Dotall (singleline) mode. The . metacharacter matches line terminators (U+000A, U+000D, U+2028, U+2029) in addition to all other characters.

Flag Usage

The exact syntax for specifying flags depends on the language binding:

Tiri:

pattern = regex.new('hello', regex.ICASE)
pattern = regex.new('.*', regex.DOT_ALL)
pattern = regex.new('^line', regex.MULTILINE | regex.ICASE)

C++:

auto pattern = kt::regex("hello", kt::regex::ICASE);
auto pattern = kt::regex(".*", kt::regex::DOT_ALL);
auto pattern = kt::regex("^line", kt::regex::MULTILINE | kt::regex::ICASE);

Flag Effects

ICASE (Case-Insensitive)

Makes pattern matching case-insensitive using Unicode case-folding:

pattern = regex.new('hello', regex.ICASE)
-- Matches: "hello", "HELLO", "Hello", "HeLLo", etc.

pattern = regex.new('[a-z]+', regex.ICASE)
-- Matches: "abc", "ABC", "aBc", etc.

MULTILINE

Changes behaviour of ^ and $ anchors to match line boundaries:

pattern = regex.new('^\\w+', regex.MULTILINE)
-- Without MULTILINE: matches word at start of string only
-- With MULTILINE: matches word at start of string AND after each line terminator

text = "first line\nsecond line\nthird line"
pattern = regex.new('^\\w+', regex.MULTILINE)
-- Matches: "first", "second", "third"

DOT_ALL

Makes . match line terminators in addition to all other characters:

pattern = regex.new('.*', regex.DOT_ALL)
-- Without DOT_ALL: .* matches up to (but not including) line terminators
-- With DOT_ALL: .* matches everything including line terminators

text = "line 1\nline 2\nline 3"
pattern = regex.new('.*', regex.DOT_ALL)
match = pattern.match(text)
-- match[1] = "line 1\nline 2\nline 3" (entire string)

Note: When DOT_ALL is set, .* will match all remaining characters in the subject string.


Match Flags

Match flags modify the behaviour of matching operations at runtime, after a pattern has been compiled. These flags are passed to matching functions (test, match, search, replace, split).

Flag Effect
NOT_BEGIN_OF_LINE Do not treat the beginning of the text as the start of a line (affects ^ in multiline mode)
NOT_END_OF_LINE Do not treat the end of the text as the end of a line (affects $ in multiline mode)
NOT_BEGIN_OF_WORD Do not treat the beginning of the text as the start of a word (affects \b)
NOT_END_OF_WORD Do not treat the end of the text as the end of a word (affects \b)
NOT_NULL Do not match empty sequences
CONTINUOUS Only match at the beginning of the text (anchored search)
PREV_AVAILABLE Indicates that the previous character position is available for lookbehind assertions
REPLACE_NO_COPY In replace operations, do not copy non-matching parts of the text
REPLACE_FIRST_ONLY In replace operations, replace only the first occurrence

Match Flag Usage

Tiri:

pattern = regex.new('\\w+')

-- Replace only first occurrence
result = pattern.replace('hello world', 'goodbye', regex.REPLACE_FIRST_ONLY)
-- result = "goodbye world"

-- Match only at beginning
match = pattern.match('hello world', regex.CONTINUOUS)
-- Succeeds (starts at beginning)

match = pattern.match('  hello', regex.CONTINUOUS)
-- Fails (does not start at beginning)

Flag Details

NOT_BEGIN_OF_LINE / NOT_END_OF_LINE

Useful when matching in the middle of a larger text:

pattern = regex.new('^hello', regex.MULTILINE)

-- Normal matching
pattern.test('hello')  -- true (at beginning)

-- With NOT_BEGIN_OF_LINE
pattern.test('hello', regex.NOT_BEGIN_OF_LINE)  -- false (not treated as line start)

NOT_NULL

Prevents matching empty strings:

pattern = regex.new('a*')

-- Normal: matches empty string
pattern.test('')  -- true

-- With NOT_NULL: rejects empty match
pattern.test('', regex.NOT_NULL)  -- false

CONTINUOUS

Forces match to start at the beginning of the text:

pattern = regex.new('\\d+')

-- Normal: finds "123" anywhere
pattern.match('  123')  -- Matches "123"

-- With CONTINUOUS: must start at position 0
pattern.match('  123', regex.CONTINUOUS)  -- Fails
pattern.match('123', regex.CONTINUOUS)     -- Succeeds

REPLACE_NO_COPY

Affects replace operations by excluding non-matching text:

pattern = regex.new('\\d+')

-- Normal replace: keeps non-matching text
pattern.replace('a123b456c', 'X')  -- "aXbXc"

-- With REPLACE_NO_COPY: only includes replacements
pattern.replace('a123b456c', 'X', regex.REPLACE_NO_COPY)  -- "XX"

REPLACE_FIRST_ONLY

Limits replacement to the first match:

pattern = regex.new('\\d+')

-- Normal replace: replaces all
pattern.replace('123 456 789', 'X')  -- "X X X"

-- With REPLACE_FIRST_ONLY: replaces only first
pattern.replace('123 456 789', 'X', regex.REPLACE_FIRST_ONLY)  -- "X 456 789"

Regular Expression Features

Kōtuku's regex implementation is based on the ECMAScript specification and provides the following characteristics:

ECMAScript Compliance

The implementation supports expressions defined in the ECMAScript Specification (latest draft), including:

  • ECMAScript 2018 (ES9): Named capture groups, lookbehind assertions, Unicode property escapes
  • ECMAScript 2025: Duplicate named capture groups, bounded flag modifiers
  • Set operations for character classes (intersection, subtraction, string sequences)

Differences from Other Engines

vs. Perl / PCRE

  • No \Q...\E literal sequences: Use explicit escaping instead
  • No possessive quantifiers: Use atomic groups or lookahead for equivalent behaviour
  • No recursive patterns: Not supported in ECMAScript
  • No conditional patterns: Use alternation with lookahead instead
  • Different Unicode categories: Follow ECMAScript Unicode property names

vs. .NET Regex

  • No balanced groups: Named captures cannot be reused except in alternatives
  • No inline comments: (?#...) is not supported
  • Different flag syntax: Uses ECMAScript (?ims:...) instead of (?imnsx-imnsx:...)

vs. POSIX

  • No POSIX character classes: Use Unicode properties instead (e.g., \p{Alpha} instead of [[:alpha:]])
  • No collating sequences: [.ch.] not supported
  • No equivalence classes: [=e=] not supported

Notable Behaviours

Forward Backreferences

Backreferences can appear before their corresponding groups:

pattern = regex.new('\\1(abc)')  -- Valid

This is valid in ECMAScript but may fail or behave differently in other engines.

Undefined Group Matching

Backreferences to groups that haven't captured anything match the empty string:

pattern = regex.new('(a)?b\\1')
-- Matches "ab" (group 1 captured nothing, so \1 matches empty string)

No Octal Escapes

The ECMAScript specification does not define octal escape sequences like \ooo or \0ooo (except \0 for NULL):

-- Valid
pattern = regex.new('\\0')     -- Matches NULL (U+0000)

-- Invalid (not defined by ECMAScript)
pattern = regex.new('\\101')   -- Error: invalid escape

Use hexadecimal or Unicode escapes instead:

pattern = regex.new('\\x41')    -- 'A' in hexadecimal
pattern = regex.new('\\u0041')  -- 'A' in Unicode

Substituting Advanced Features

Some operations not directly supported can be achieved through alternative patterns:

Intersection (Alternative Method):

-- Direct: [\p{sc=Latin}&&\p{Ll}]
-- Alternative: using lookahead
(?=\\p{sc=Latin})\\p{Ll}

Subtraction (Alternative Method):

-- Direct: [\p{sc=Latin}--\p{Ll}]
-- Alternative: using negative lookahead
(?!\\p{Ll})\\p{sc=Latin}

Atomic Groups:

-- Perl/PCRE: (?>pattern)
-- ECMAScript equivalent: (?=(pattern))\1

Performance Considerations

Compile Once, Use Many Times

Regex patterns should be compiled once and reused:

Inefficient:

for i = 1, 10000 do
   pattern = regex.new('\\d+')  -- Compiles pattern 10,000 times
   pattern.test(data[i])
end

Efficient:

pattern = regex.new('\\d+')  -- Compiles pattern once
for i = 1, 10000 do
   pattern.test(data[i])  -- Reuses compiled pattern
end

Store Patterns in Deferred Expressions

Store frequently used patterns in variables (local or global) using deferred expressions rather than recreating them:

-- Compiled patterns
emailPattern = <{ regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}') }>
phonePattern = <{ regex.new('\\d{3}-\\d{3}-\\d{4}') }>
datePattern = <{ regex.new('\\d{4}-\\d{2}-\\d{2}') }>

-- Use patterns multiple times efficiently
for contact in values(contacts) do
   if emailPattern.test(contact.email) then
      processEmail(contact)
   end
   if phonePattern.test(contact.phone) then
      processPhone(contact)
   end
end

Greedy vs Non-Greedy Quantifiers

Non-greedy quantifiers can improve performance in some cases:

-- Greedy: tries to match as much as possible, then backtracks
pattern = regex.new('<.*>')
-- Matches: "<tag>content</tag>" as one match (backtracks from end)

-- Non-greedy: stops at first opportunity
pattern = regex.new('<.*?>')
-- Matches: "<tag>" and "</tag>" separately (no backtracking)

For HTML/XML parsing, non-greedy is typically faster:

-- Extract tag content efficiently
pattern = regex.new('<([^>]+)>(.*?)</\\1>')

Avoid Catastrophic Backtracking

Certain patterns can cause exponential time complexity:

Dangerous Pattern:

-- Exponential backtracking on non-match
pattern = regex.new('(a+)+b')
text = 'aaaaaaaaaaaaaaaaaac'  -- No 'b' at end
-- This takes exponential time as pattern length increases

Solutions:

  1. Use possessive-like behaviour:

    -- Prevent backtracking with atomic group simulation
    pattern = regex.new('(?=(a+))\\1+b')
  2. Use negated character classes:

    -- Clearer intent, better performance
    pattern = regex.new('[^b]+b')
  3. Be specific about what you're matching:

    -- Instead of: .*
    -- Use: [^<]+ (if not matching '<')
    -- Use: \\w+ (if matching word characters)

Character Class Optimisations

Use predefined classes when possible:

-- Faster
pattern = regex.new('\\d+')

-- Slower (equivalent but not optimised)
pattern = regex.new('[0-9]+')

Simplify complex classes:

-- Complex
pattern = regex.new('[A-Za-z0-9_]+')

-- Simpler and equivalent
pattern = regex.new('\\w+')

Anchoring Patterns

Anchor patterns to reduce search space:

-- Unanchored: searches entire string
pattern = regex.new('\\d+')

-- Anchored: only checks from beginning
pattern = regex.new('^\\d+')

-- Anchored both ends: exact match only
pattern = regex.new('^\\d+$')

Unicode Property Matching

Unicode properties are optimised internally, but broad categories are faster than specific scripts:

-- Faster: general category
pattern = regex.new('\\p{L}+')  -- All letters

-- Slower: specific script
pattern = regex.new('\\p{sc=Latin}+')  -- Latin letters only

Best Practices Summary

  1. Compile patterns once, reuse many times
  2. Store patterns in variables
  3. Use non-greedy quantifiers when appropriate
  4. Anchor patterns when possible (^, $)
  5. Avoid nested quantifiers that can cause exponential backtracking
  6. Use predefined character classes (\d, \w, \s)
  7. Be specific in patterns to reduce backtracking
  8. Test performance with realistic data

Common Patterns and Examples

This section provides practical regex patterns for common use cases.

Email Validation

Basic email pattern:

pattern = regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}')
-- Matches: user@example.com, first.last@sub.domain.co.uk

Explanation:

  • [\w._%+-]+ - Username: word characters, dots, underscores, percent, plus, hyphen
  • @ - Literal @ symbol
  • [\w.-]+ - Domain name: word characters, dots, hyphens
  • \. - Literal dot
  • [A-Za-z]{2,} - Top-level domain: 2 or more letters

More strict pattern:

pattern = regex.new('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$')
-- Anchored to match entire string

URL Matching

Basic URL pattern:

pattern = regex.new('(https?)://([^/\\s]+)([^\\s]*)')
-- Captures: protocol, domain, path

match = pattern.match('https://example.com/path?query=value')
-- match[1] = "https://example.com/path?query=value" (full match)
-- match[2] = "https" (protocol)
-- match[3] = "example.com" (domain)
-- match[4] = "/path?query=value" (path)

With named captures:

pattern = regex.new('(?<protocol>https?)://(?<domain>[^/\\s]+)(?<path>[^\\s]*)')

match = pattern.match('https://example.com/path')
-- Access by name: match.domain (language binding dependent)
-- Access by number: match[3]

Phone Numbers

US phone number:

-- Format: 555-123-4567
pattern = regex.new('\\d{3}-\\d{3}-\\d{4}')

-- With optional country code: +1-555-123-4567
pattern = regex.new('(\\+1-)?\\d{3}-\\d{3}-\\d{4}')

-- With optional separators (-, ., space, or none)
pattern = regex.new('\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}')

International E.164 format:

-- +1234567890 to +123456789012345
pattern = regex.new('\\+\\d{1,15}')

Date Matching

ISO 8601 date (YYYY-MM-DD):

pattern = regex.new('\\d{4}-\\d{2}-\\d{2}')
-- Matches: 2025-10-14

-- With validation (basic):
pattern = regex.new('\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])')
-- Validates month (01-12) and day (01-31)

US date format (MM/DD/YYYY):

pattern = regex.new('(0[1-9]|1[0-2])/(0[1-9]|[12]\\d|3[01])/\\d{4}')
-- Matches: 10/14/2025

Flexible date format:

pattern = regex.new('\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}')
-- Matches: 10/14/2025, 10-14-25, 1/5/2025

Time Matching

24-hour time (HH:MM):

pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d')
-- Matches: 09:30, 23:59, 8:05

-- With optional seconds:
pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d(:[0-5]\\d)?')
-- Matches: 09:30, 09:30:45

12-hour time with AM/PM:

pattern = regex.new('(0?[1-9]|1[0-2]):[0-5]\\d\\s*([AaPp][Mm])')
-- Matches: 9:30 AM, 12:45 PM, 9:30AM

Password Validation

Minimum requirements (8+ chars, 1 uppercase, 1 lowercase, 1 digit):

pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d).{8,}$')

Explanation:

  • ^ - Start of string
  • (?=.*[a-z]) - Lookahead: at least one lowercase
  • (?=.*[A-Z]) - Lookahead: at least one uppercase
  • (?=.*\d) - Lookahead: at least one digit
  • .{8,} - At least 8 characters
  • $ - End of string

With special character requirement:

pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@$!%*?&]).{8,}$')

IP Address Matching

IPv4 address:

pattern = regex.new('\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b')
-- Matches: 192.168.1.1, 10.0.0.1

-- With validation (0-255 per octet):
pattern = regex.new('\\b(?:(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\b')

IPv6 address (simplified):

pattern = regex.new('(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}')
-- Matches full IPv6: 2001:0db8:85a3:0000:0000:8a2e:0370:7334

HTML/XML Tag Matching

Match opening and closing tags:

pattern = regex.new('<([a-zA-Z][a-zA-Z0-9]*)\\b[^>]*>(.*?)</\\1>')
-- Matches: <div>content</div>, <span class="x">text</span>
-- Captures: tag name (group 1), content (group 2)

Extract tag content:

pattern = regex.new('<[^>]+>(.*?)</[^>]+>')
-- Captures content between any tags

Match self-closing tags:

pattern = regex.new('<[a-zA-Z][a-zA-Z0-9]*\\b[^>]*/>')
-- Matches: <br/>, <img src="x" />

CSV Parsing

Basic CSV field:

pattern = regex.new('([^,]+),?')
-- Matches fields separated by commas

CSV with quoted fields:

pattern = regex.new('(?:^|,)(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|([^,]*))')
-- Handles: "quoted field", unquoted, "field with ""quotes"""

Word Extraction

Extract words:

pattern = regex.new('\\b\\w+\\b')
-- Matches: any word (alphanumeric + underscore)

pattern = regex.new('\\b[A-Za-z]+\\b')
-- Matches: only alphabetic words

Extract words with apostrophes:

pattern = regex.new('\\b[A-Za-z]+(?:\'[A-Za-z]+)?\\b')
-- Matches: don't, it's, can't, etc.

Number Extraction

Integer:

pattern = regex.new('-?\\d+')
-- Matches: 123, -456

Floating point:

pattern = regex.new('-?\\d+\\.\\d+')
-- Matches: 123.45, -67.89

-- With optional decimal part:
pattern = regex.new('-?\\d+(?:\\.\\d+)?')
-- Matches: 123, 123.45, -67.89

Scientific notation:

pattern = regex.new('-?\\d+(?:\\.\\d+)?(?:[eE][+-]?\\d+)?')
-- Matches: 1.23e10, -4.5E-6, 123

Whitespace Handling

Trim leading/trailing whitespace:

pattern = regex.new('^\\s+|\\s+$')
-- Use with replace to remove leading/trailing spaces

Collapse multiple spaces:

pattern = regex.new('\\s+')
-- Replace with single space to normalize whitespace

Split on whitespace:

pattern = regex.new('\\s+')
-- Use with split to separate words

File Path Matching

Unix/Linux path:

pattern = regex.new('^(/[^/]+)+/?$')
-- Matches: /home/user/file.txt, /usr/local/bin/

Windows path:

pattern = regex.new('^[A-Za-z]:\\\\(?:[^\\\\/:*?\"<>|]+\\\\)*[^\\\\/:*?\"<>|]*$')
-- Matches: C:\Users\Name\file.txt

File extension:

pattern = regex.new('\\.([A-Za-z0-9]+)$')
-- Captures file extension: .txt, .pdf, .jpg

Version Number Matching

Semantic versioning:

pattern = regex.new('^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(?:-((?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+([0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$')
-- Matches: 1.0.0, 2.1.3, 1.0.0-alpha.1, 1.0.0+build.123

Simple version:

pattern = regex.new('\\d+\\.\\d+(?:\\.\\d+)?')
-- Matches: 1.0, 1.0.5, 2.10.1

Error Handling

When pattern compilation or matching fails, specific error types indicate the nature of the problem. Understanding these errors helps diagnose and fix pattern issues.

Compilation Errors

These errors occur when compiling a regex pattern:

Error Description Example
error_escape Invalid escape sequence \q (undefined escape), \c (not followed by letter), \x (not followed by two hex digits), \u{GGGG} (invalid hex)
error_brack Mismatched square brackets [abc, abc], [a[b] (nested)
error_paren Mismatched parentheses (abc, abc), ((a) (unclosed)
error_brace Mismatched curly braces a{3, a3}, a{2,} (missing closing brace)
error_badbrace Invalid quantifier range {3,2} (n > m), {-1} (negative), {,5} (missing n)
error_range Invalid character range in class [z-a] (reversed), [\u0100-\u0010] (start > end)
error_backref Invalid backreference \9 (group doesn't exist), \k<name> (name doesn't exist)
error_modifier Invalid flag modifier (?ii:...) (duplicate flag), (?i-i:...) (contradictory)
error_operator Invalid set operator usage [AB--CD] (mixed operators at same level), !! (reserved double punctuator in class)
error_noescape Character must be escaped [(] (should be [\(]), [{] (should be [\{]) in character classes
error_complement Invalid negation [^\p{RGI_Emoji}] (string property in negated class), \P{RGI_Emoji} (string property with \P)
error_badrepeat Quantifier without preceding expression *abc (starts with quantifier), a** (double quantifier)
error_utf8 Invalid UTF-8 sequence in pattern Pattern contains invalid UTF-8 bytes, overlong encoding, or code point > U+10FFFF

Error Examples

error_escape

-- Invalid: \q is not defined
pattern = regex.new('\\q')  -- Error: invalid escape sequence

-- Invalid: \c not followed by letter
pattern = regex.new('\\c5')  -- Error: expected A-Z or a-z after \c

-- Invalid: \x not followed by two hex digits
pattern = regex.new('\\xGG')  -- Error: expected two hex digits

-- Invalid: code point exceeds maximum
pattern = regex.new('\\u{110000}')  -- Error: code point > U+10FFFF

-- Valid alternatives:
pattern = regex.new('q')           -- Literal q
pattern = regex.new('\\x71')       -- Hex escape for q
pattern = regex.new('\\u0071')     -- Unicode escape for q

error_brack

-- Invalid: unclosed bracket
pattern = regex.new('[abc')  -- Error: missing ]

-- Invalid: extra closing bracket
pattern = regex.new('abc]')  -- Error: unmatched ]

-- Valid:
pattern = regex.new('[abc]')       -- Correct bracket pair
pattern = regex.new('\\]')         -- Escaped bracket (literal)

error_paren

-- Invalid: unclosed parenthesis
pattern = regex.new('(abc')  -- Error: missing )

-- Invalid: extra closing parenthesis
pattern = regex.new('abc)')  -- Error: unmatched )

-- Valid:
pattern = regex.new('(abc)')       -- Correct parenthesis pair
pattern = regex.new('\\(abc\\)')   -- Escaped parentheses (literals)

error_brace

-- Invalid: unclosed brace
pattern = regex.new('a{3')  -- Error: missing }

-- Valid:
pattern = regex.new('a{3}')        -- Correct quantifier
pattern = regex.new('\\{3\\}')     -- Escaped braces (literals)

error_badbrace

-- Invalid: n > m in range
pattern = regex.new('a{5,3}')  -- Error: 5 > 3

-- Invalid: missing n
pattern = regex.new('a{,5}')  -- Error: must specify n

-- Valid:
pattern = regex.new('a{3,5}')      -- n ≤ m
pattern = regex.new('a{3,}')       -- n or more (no maximum)
pattern = regex.new('a{3}')        -- exactly n

error_range

-- Invalid: reversed range
pattern = regex.new('[z-a]')  -- Error: z (U+007A) > a (U+0061)

-- Invalid: empty range
pattern = regex.new('[\\u0100-\\u0010]')  -- Error: start > end

-- Valid:
pattern = regex.new('[a-z]')       -- Correct range
pattern = regex.new('[z]')         -- Single character (no range)

error_backref

-- Invalid: group doesn't exist
pattern = regex.new('\\5')  -- Error: no group 5

-- Invalid: named group doesn't exist
pattern = regex.new('\\k<missing>')  -- Error: no group named 'missing'

-- Valid:
pattern = regex.new('(a)\\1')            -- Backreference to group 1
pattern = regex.new('(?<x>a)\\k<x>')    -- Named backreference

error_modifier

-- Invalid: duplicate flag
pattern = regex.new('(?ii:abc)')  -- Error: 'i' appears twice

-- Invalid: contradictory flags
pattern = regex.new('(?i-i:abc)')  -- Error: both +i and -i

-- Valid:
pattern = regex.new('(?i:abc)')          -- Single flag
pattern = regex.new('(?im:abc)')         -- Multiple different flags
pattern = regex.new('(?i-m:abc)')        -- Enable and disable flags

error_operator

-- Invalid: mixed operators at same level
pattern = regex.new('[AB--CD]')  -- Error: union (AB) then subtraction

-- Invalid: reserved double punctuator
pattern = regex.new('[a-z!!]')  -- Error: !! is reserved

-- Valid:
pattern = regex.new('[[AB]--[CD]]')      -- Nested classes
pattern = regex.new('[A[B--C]D]')        -- Operator in nested level
pattern = regex.new('[a-z\\!\\!]')       -- Escaped (two separate !)

error_noescape

-- Invalid: ( must be escaped in character class
pattern = regex.new('[(]')  -- Error: must escape (

-- Invalid: { must be escaped
pattern = regex.new('[{]')  -- Error: must escape {

-- Valid:
pattern = regex.new('[\\(]')             -- Escaped (
pattern = regex.new('[\\{\\}]')          -- Escaped braces

error_complement

-- Invalid: string property in negated class
pattern = regex.new('[^\\p{RGI_Emoji}]')  -- Error: cannot negate string property

-- Invalid: string property with \P
pattern = regex.new('\\P{RGI_Emoji}')     -- Error: \P doesn't support string properties

-- Valid:
pattern = regex.new('[\\p{RGI_Emoji}]')      -- String property in positive class
pattern = regex.new('\\P{Emoji}')             -- Character property (not string)
pattern = regex.new('[^\\p{Emoji}]')          -- Character property negated

error_badrepeat

-- Invalid: quantifier at start
pattern = regex.new('*abc')  -- Error: nothing to repeat

-- Invalid: double quantifier
pattern = regex.new('a**')   -- Error: quantifier on quantifier

-- Valid:
pattern = regex.new('a*bc')              -- Quantifier after character
pattern = regex.new('\\*abc')            -- Escaped * (literal)

error_utf8

-- Invalid UTF-8 in pattern
-- (This typically occurs when pattern strings contain invalid byte sequences)

-- Invalid: overlong encoding
pattern = regex.new('\\xC0\\xB0')  -- Error: overlong form of U+0030

-- Valid:
pattern = regex.new('\\x30')             -- Shortest form
pattern = regex.new('\\u0030')           -- Unicode escape

Handling Errors in Code

Tiri:

try
   pattern = regex.new('[invalid')
except ex
   print('Pattern compilation failed: ' .. ex.message)
success
   -- Use pattern
end

Debugging Tips

  1. Test patterns incrementally: Build complex patterns step by step, testing each addition
  2. Use online regex testers: Many tools visualise patterns and highlight errors (ensure they support ECMAScript syntax)
  3. Check bracket matching: Count opening and closing brackets/parentheses/braces
  4. Validate escape sequences: Ensure all \ sequences are valid
  5. Review operator precedence: Verify set operations are properly nested
  6. Examine Unicode sequences: Confirm \u{...} values are valid code points
  7. Test with edge cases: Try empty strings, very long strings, and strings with special characters

Common Mistakes

Forgetting to escape special characters:

-- Wrong: . matches any character
pattern = regex.new('file.txt')
-- Matches: "file.txt", "file?txt", "fileXtxt"

-- Correct: \. matches literal dot
pattern = regex.new('file\\.txt')
-- Matches: "file.txt" only

Incorrect bracket nesting:

-- Wrong: brackets don't nest this way
pattern = regex.new('[[a-z]')  -- Error

-- Correct: nest with operators
pattern = regex.new('[[a-m][n-z]]')  -- Union of two ranges

Quantifier on quantifier:

-- Wrong: double quantifier
pattern = regex.new('a*+')  -- Error

-- Correct: quantify group
pattern = regex.new('(a*)+')

Summary

This manual has covered the complete regular expression syntax and features supported by Kōtuku:

  • Character matching including Unicode escapes and special characters
  • Character classes with ranges, predefined classes, and set operations
  • Quantifiers for controlling repetition (greedy and non-greedy)
  • Groups and backreferences for capturing and reusing matched text
  • Assertions for zero-width matching conditions
  • Unicode support with full UTF-8 and property matching
  • Flags for controlling compilation and matching behaviour
  • Performance considerations for efficient pattern usage
  • Common patterns for practical applications
  • Error handling for debugging pattern issues

For API documentation on the Regex class and its methods, please refer to the Regex module documentation in the Kōtuku API reference.


This manual documents the regex implementation as of 2026. For updates and the latest specification, refer to the ECMAScript Specification.

Clone this wiki locally