-
-
Notifications
You must be signed in to change notification settings - Fork 2
Regex Manual
This manual provides comprehensive technical documentation for regular expression support in Kōtuku. It covers pattern syntax, behaviour, and all supported features for both Tiri and C++ developers.
For information on how to use the Regex class and its methods in your code, please refer to the Regex module API documentation.
- Introduction
- Character Matching
- Alternatives
- Character Classes
- Quantifiers
- Grouping and Backreferences
- Flag Modifiers
- Assertions
- Unicode Support
- Compilation Flags
- Match Flags
- Regular Expression Features
- Performance Considerations
- Common Patterns and Examples
- Error Handling
Kōtuku's regex support is based on the regular expression syntax defined in the ECMAScript Specification. The implementation provides full Unicode support with UTF-8 encoding enabled by default.
Key features include:
- ECMAScript Compliance: Supports expressions defined in the latest ECMAScript specification draft
- Unicode Support: Full UTF-8 Unicode support with property matching
- Set Operations: Character class intersection, subtraction, and string sequences
- Named Captures: Named and numbered capture groups with backreferences
- Lookahead and Lookbehind: Zero-width assertions for complex matching
- Flag Modifiers: Inline flag control for case sensitivity, multiline, and dotall modes
Regular expressions are compiled into pattern objects that can be reused for efficient matching operations.
Regular expressions match characters in the target text based on pattern specifications. The following table describes all character matching forms:
| Pattern | Description | Example |
|---|---|---|
. |
Matches any character except line terminators (U+000A, U+000D, U+2028, U+2029). With the dotall flag, matches every code point. |
a.c matches "abc", "aXc"
|
\0 |
Matches NULL character (U+0000) |
\0 matches null byte |
\t |
Matches Horizontal Tab (U+0009) |
a\tb matches "a b"
|
\n |
Matches Line Feed (U+000A) |
a\nb matches "a\nb"
|
\v |
Matches Vertical Tab (U+000B) |
\v matches vertical tab |
\f |
Matches Form Feed (U+000C) |
\f matches form feed |
\r |
Matches Carriage Return (U+000D) |
\r\n matches Windows line ending |
\cX |
Matches control character where X is A-Z or a-z. Value is (code point of X) & 0x1F
|
\cA matches Ctrl-A (U+0001) |
\\ |
Matches backslash character (U+005C) |
\\ matches "\"
|
\xHH |
Matches character with hexadecimal code HH (00-FF) |
\x41 matches "A"
|
\uHHHH |
Matches character with Unicode code point HHHH
|
\u0041 matches "A"
|
\u{H...} |
Matches character with Unicode code point represented by hex digits (up to 10FFFF) |
\u{1F600} matches 😀 |
\X |
When X is one of ^ $ . * + ? ( ) [ ] { } | /, matches X literally |
\( matches "("
|
| Character | Any character not listed above matches itself |
abc matches "abc"
|
Line terminator code points are: U+000A (Line Feed), U+000D (Carriage Return), U+2028 (Line Separator), and U+2029 (Paragraph Separator).
All escape sequences must be complete and valid. If \c is not followed by a letter A-Z or a-z, \x is not followed by two hexadecimal digits, \u is not followed by four hexadecimal digits, or \u{...} does not contain valid hexadecimal or exceeds U+10FFFF, an error_escape exception is thrown.
In character classes (see Character Classes), the hyphen - can also be escaped as \-. The character ] must always be escaped as \] to be matched literally.
The | operator matches one of multiple alternative patterns, evaluated from left to right:
A|B|C
This matches pattern A, or pattern B, or pattern C. The first successful match is adopted, and remaining alternatives are not evaluated.
pattern = regex.new('abc|abcdef')
match = pattern.match('abcdef')
-- match[1] = "abc" (not "abcdef")Even though "abcdef" would match the second alternative completely, the pattern matches "abc" from the first alternative because alternatives are evaluated left to right.
Multiple alternatives can be combined:
pattern = regex.new('cat|dog|bird|fish')Character classes define sets of characters that can match at a single position in the target text.
A character class is enclosed in square brackets [...] and matches any single character from the set:
| Pattern | Description | Example |
|---|---|---|
[ABC] |
Matches any of A, B, or C |
[ABC] matches "A", "B", or "C"
|
[^DEF] |
Matches any character except D, E, or F (negated class) |
[^DEF] matches any character but "D", "E", "F"
|
[G^H] |
Matches G, ^, or H (^ not first, so literal) |
[G^H] matches "G", "^", or "H"
|
[I-K] |
Matches any character from I to K inclusive (range) |
[I-K] matches "I", "J", "K"
|
[-LM] |
Matches -, L, or M (leading hyphen is literal) |
[-LM] matches "-", "L", "M"
|
[N-P-R] |
Matches N, O, P, -, or R (trailing hyphen after range is literal) |
[N-P-R] matches "N", "O", "P", "-", "R"
|
[S\-U] |
Matches S, -, or U (escaped hyphen) |
[S\-U] matches "S", "-", "U"
|
[.({|] |
Special regex characters lose their special meaning in character classes |
[.({|] matches ".", "(", "{", "|"
|
[] |
Empty class matches no code points (always fails) |
[] never matches |
[^] |
Complement of empty class matches any code point |
[^] matches any character including line terminators |
-
Negation: When
^is the first character in[], the class is negated and matches any character NOT in the set. -
Closing Bracket: The
]character must always be escaped as\]to be included literally in a character class. -
Hyphen: The
-character creates a range when between two characters. To match-literally, place it first, last, or escape it as\-. -
Special Characters: Most special regex characters (
.,*,+, etc.) lose their special meaning inside character classes.
Ranges define a span of consecutive Unicode code points:
pattern = regex.new('[A-Z]') -- Matches any uppercase letter A-Z
pattern = regex.new('[0-9]') -- Matches any digit
pattern = regex.new('[a-zA-Z]') -- Matches any letterIf the range is invalid (e.g., [b-a] where the starting code point is greater than the ending code point), an error_range exception is thrown.
When case-insensitive matching is enabled (with the icase flag), character classes expand to include case-folded variations:
pattern = regex.new('[E-F]', regex.ICASE)
-- Matches 'E', 'F', 'e', 'f', and any Unicode case variantsNote: Range [E-f] with icase flag will match all characters from U+0045 ('E') to U+0066 ('f'), including brackets, backslash, and other punctuation, plus their case-folded variants.
Predefined character classes provide convenient shortcuts for common character sets:
| Pattern | Equivalent | Description |
|---|---|---|
\d |
[0-9] |
Matches any decimal digit |
\D |
[^0-9] |
Matches any non-digit |
\s |
[ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] |
Matches any whitespace character (WhiteSpace + LineTerminator) |
\S |
[^ \t\n\v\f\r\u00a0\u1680\u2000-\u200a\u2028-\u2029\u202f\u205f\u3000\ufeff] |
Matches any non-whitespace |
\w |
[0-9A-Za-z_] |
Matches any word character (alphanumeric + underscore) |
\W |
[^0-9A-Za-z_] |
Matches any non-word character |
\p{...} |
(See Unicode Support) | Matches characters with specified Unicode property |
\P{...} |
(See Unicode Support) | Matches characters without specified Unicode property |
All predefined character classes can be used inside character classes:
pattern = regex.new('[\\d!\"#$%&\'()]') -- Matches digits or punctuationNote: The \s whitespace class automatically expands when new code points are added to Unicode category Zs.
Character classes support advanced set operations for precise character matching. These operations are always available as standard features.
The intersection operator && matches characters that belong to both sets:
[A&&B]
Examples:
-- Match lowercase Latin letters only
pattern = regex.new('[\\p{sc=Latin}&&\\p{Ll}]')
-- Matches: a, b, c, ..., z, ñ, ø, etc. (lowercase Latin)
-- Does NOT match: A, B, C, ... (not lowercase)
-- Match ASCII letters only (not extended Latin)
pattern = regex.new('[\\p{sc=Latin}&&[A-Za-z]]')The subtraction operator -- matches characters in the first set but not in the second:
[A--B]
Examples:
-- Match Latin letters that are NOT lowercase
pattern = regex.new('[\\p{sc=Latin}--\\p{Ll}]')
-- Matches: A, B, C, ..., Z (uppercase, titlecase, etc.)
-- Does NOT match: a, b, c, ... (lowercase excluded)
-- Match letters except vowels
pattern = regex.new('[A-Za-z--[AEIOUaeiou]]')
-- Matches: consonants onlyThe \q{...} syntax allows character classes to match multi-character sequences:
[a-z\q{ch|th|ph}]
This matches either:
- Any single character from
a-z, OR - The sequence
"ch", OR - The sequence
"th", OR - The sequence
"ph"
Longest Match Priority: When strings are included in a character class, the longest matching string is always selected first:
pattern = regex.new('[a-z\\q{ch|chocolate}]')
-- When matching "chocolate", matches the full word "chocolate"
-- Not "ch" followed by "ocolate"The sequence [a-z\q{ch|th|ph}] is functionally equivalent to (?:ch|th|ph|[a-z]).
Examples:
-- Match common digraphs or single letters
pattern = regex.new('[a-z\\q{ch|sh|th|ph}]+')
-- Match emoji sequences or letters
pattern = regex.new('[A-Z\\q{:-)|:-(|:-D}]')String sequences can be used with all set operators (union, intersection, subtraction).
Character classes can be nested as operands for set operations:
-- Valid: nested classes with operators
pattern = regex.new('[\\p{sc=Latin}--[a-z]]')
-- Valid: nested union and subtraction
pattern = regex.new('[A[B--C]D]')Operator Restriction: Only one type of operator can be used per level of nesting:
-- INVALID: mixing && and -- at same level
[AB--CD] -- Error: union (AB) then subtraction (--)
-- VALID: operators in different nesting levels
[[AB]--[CD]] -- OK: separate nesting levels
[A[B--C]D] -- OK: subtraction inside unionMultiple uses of the same operator are permitted:
-- Valid: multiple subtractions at same level
[\\p{sc=Latin}--\\p{Lu}--[a-z]]The following characters must be escaped with \ when used literally in character classes:
-
(,),[,{,},/,-,| -
]must always be escaped (even outside character classes)
-- Correct
pattern = regex.new('[\\(\\)\\[\\]\\{\\}]')
-- Incorrect (throws error_noescape)
pattern = regex.new('[(]')The following 18 double-character sequences are reserved for future use and cannot appear in character classes:
!! ## $$ %% ** ++ ,, ..
:: ;; << == >> ?? @@ ^^
`` ~~
If any of these appear in a character class, an error_operator exception is thrown.
Quantifiers specify how many times a pattern element must match. Each quantifier has a greedy and non-greedy form.
| Quantifier | Non-Greedy | Matches | Description |
|---|---|---|---|
* |
*? |
0 or more | Repeats the preceding element zero or more times |
+ |
+? |
1 or more | Repeats the preceding element one or more times |
? |
?? |
0 or 1 | Makes the preceding element optional |
{n} |
N/A | Exactly n | Repeats the preceding element exactly n times |
{n,} |
{n,}? |
n or more | Repeats the preceding element at least n times |
{n,m} |
{n,m}? |
n to m | Repeats the preceding element between n and m times (inclusive) |
Greedy quantifiers (default) match as many characters as possible while still allowing the overall pattern to succeed:
pattern = regex.new('a.*b')
match = pattern.match('axxxbxxxb')
-- match[1] = "axxxbxxxb" (matches up to the last 'b')Non-greedy quantifiers (with ? suffix) match as few characters as possible while still allowing the overall pattern to succeed:
pattern = regex.new('a.*?b')
match = pattern.match('axxxbxxxb')
-- match[1] = "axxxb" (stops at the first 'b')-
Quantifiers must have a preceding expression to quantify. Using a quantifier without a preceding element (e.g.,
*at the start of a pattern) throwserror_badrepeat. -
If a quantifier range is invalid (e.g.,
{3,2}where n > m), anerror_badbraceexception is thrown. -
Mismatched
{or}characters throwerror_brace.
-- Match one or more digits
pattern = regex.new('\\d+')
-- Match optional sign followed by digits
pattern = regex.new('[+-]?\\d+')
-- Match exactly 3 letters
pattern = regex.new('[A-Za-z]{3}')
-- Match 2 to 4 word characters (greedy)
pattern = regex.new('\\w{2,4}')
-- Match 2 to 4 word characters (non-greedy)
pattern = regex.new('\\w{2,4}?')
-- Match at least 5 digits
pattern = regex.new('\\d{5,}')When a capturing group is quantified, the captured value is updated on each iteration. Only the last iteration's match is preserved:
pattern = regex.new('(?:(a)|(b))+')
match = pattern.match('ab')
-- match[1] = "ab" (full match)
-- match[2] = "" (empty, last iteration captured 'b', not 'a')
-- match[3] = "b" (last iteration captured 'b')Parentheses create groups for capturing matches and controlling operator precedence.
Capturing groups are created with (...) and are numbered starting from 1:
pattern = regex.new('(\\d{3})-(\\d{3})-(\\d{4})')
match = pattern.match('555-123-4567')
-- match[1] = "555-123-4567" (full match, always at index 1)
-- match[2] = "555" (first capturing group)
-- match[3] = "123" (second capturing group)
-- match[4] = "4567" (third capturing group)Group Numbering: Groups are numbered by the position of their opening ( parenthesis from left to right:
pattern = regex.new('((a)(b))c')
-- Group 1: ((a)(b))
-- Group 2: (a)
-- Group 3: (b)Non-capturing groups (?:...) group expressions without creating a capture:
pattern = regex.new('(?:tak(?:e|ing))')
-- Matches "take" or "taking" without capturingUse non-capturing groups to:
- Apply quantifiers to multiple characters:
(?:ab)+ - Group alternatives:
(?:cat|dog) - Improve performance (slightly faster than capturing groups)
Named groups associate a name with a captured substring:
(?<name>...)
Example:
pattern = regex.new('(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})')
match = pattern.match('2025-10-14')
-- match[1] = "2025-10-14"
-- match[2] = "2025" (group 1, also accessible as 'year')
-- match[3] = "10" (group 2, also accessible as 'month')
-- match[4] = "14" (group 3, also accessible as 'day')Named groups are also assigned a number and can be accessed by both name and number.
Named groups can be reused if they appear in different alternatives:
pattern = regex.new('(?<year>\\d{4})-\\d{1,2}|\\d{1,2}-(?<year>\\d{4})')
-- Matches "2025-10" or "10-2025"
-- 'year' captures the 4-digit year from either positionThis feature was introduced in ES2025.
A backreference \N (where N is a positive integer starting from 1) matches the same text that was captured by group N:
pattern = regex.new('(TO|to)..\\1')
-- Matches "TOMATO" or "tomato" but not "Tomato"
-- \1 refers to captured text from group 1Example:
pattern = regex.new('(["\']).*?\\1')
-- Matches string in quotes: "hello" or 'hello'
-- But not mixed quotes: "hello'A backreference \k<name> matches the text captured by a named group:
pattern = regex.new('(?<quote>["\']).*?\\k<quote>')
-- Same as above, but using named group-
Forward References: Backreferences can appear before their corresponding group:
pattern = regex.new('\\1(abc)') -- Valid in ECMAScript
-
Undefined Matches: A backreference to a group that hasn't captured anything matches the empty string:
pattern = regex.new('(a)?b\\1') -- Matches "b" (group 1 didn't capture, so \1 matches empty string)
-
Invalid Groups: If a backreference refers to a non-existent group number, an
error_backrefexception is thrown:pattern = regex.new('\\5') -- Error: no group 5 exists
When a capturing group is inside a quantified expression, captures are cleared on each iteration:
pattern = regex.new('(?:(a)|(b))+')
match = pattern.match('ab')
-- Only the last iteration's captures are retained
-- match[2] = "" (group 1's last iteration matched nothing)
-- match[3] = "b" (group 2's last iteration matched "b")Flag modifiers allow inline control of matching behaviour within specific parts of a pattern.
Bounded flag modifiers enable or disable flags only within a specific group:
(?ims-ims:...)
Available Flags:
| Flag | Meaning |
|---|---|
i |
Case-insensitive matching (icase) |
m |
Multiline mode (^ and $ match line boundaries) |
s |
Dotall mode (. matches line terminators) |
-i |
Disable case-insensitive matching |
-m |
Disable multiline mode |
-s |
Disable dotall mode |
Examples:
-- Case-insensitive only for middle section
pattern = regex.new('hello(?i:world)THERE')
-- Matches: "helloworldTHERE", "helloWORLDTHERE", "helloWoRlDTHERE"
-- Does NOT match: "HELLOworldthere" (case-sensitive outside group)
-- Combine multiple flags
pattern = regex.new('(?ims:.*)')
-- Case-insensitive + multiline + dotall for entire group
-- Disable flags
pattern = regex.new('(?i)hello(?-i:world)')
-- "hello" is case-insensitive, "world" is case-sensitive-
Single Use per Flag: Each flag letter can only appear once per modifier group:
-- INVALID: 'i' appears twice (?ii:...) -- Throws error_modifier (?i-i:...) -- Throws error_modifier
-
Scope: Flag modifiers affect only the expressions inside their group.
-
ES2025 Feature: Bounded flag modifiers were introduced in ES2025 and are enabled by default.
Assertions test conditions at the current position without consuming characters (zero-width).
| Assertion | Description |
|---|---|
^ |
Matches at the start of the string. With multiline flag, also matches immediately after line terminators. |
$ |
Matches at the end of the string. With multiline flag, also matches immediately before line terminators. |
Examples:
-- Match lines starting with "#"
pattern = regex.new('^#.*', regex.MULTILINE)
-- Match lines ending with ";"
pattern = regex.new('.*;$', regex.MULTILINE)| Assertion | Description |
|---|---|
\b |
Matches at a word boundary (between \w and \W) |
\B |
Matches at a non-word boundary (not between \w and \W) |
Examples:
-- Match "cat" as a whole word
pattern = regex.new('\\bcat\\b')
-- Matches: "cat in hat"
-- Does NOT match: "concatenate"
-- Match "cat" not as a whole word
pattern = regex.new('\\Bcat\\B')
-- Matches: "concatenate"
-- Does NOT match: "cat in hat"Note: Inside a character class [...], \b matches the BEL character (U+0008), not a word boundary. Using \B inside a character class throws error_escape.
Lookahead assertions check if a pattern matches ahead without consuming characters:
| Assertion | Description |
|---|---|
(?=...) |
Positive lookahead: succeeds if pattern matches ahead |
(?!...) |
Negative lookahead: succeeds if pattern does NOT match ahead |
Examples:
-- Match "a" only if followed by "bc" or "def"
pattern = regex.new('a(?=bc|def)')
-- Matches: "abc" (captures "a"), "adef" (captures "a")
-- Does NOT match: "axyz"
-- Match "a" only if NOT followed by "bc" or "def"
pattern = regex.new('a(?!bc|def)')
-- Matches: "axyz" (captures "a")
-- Does NOT match: "abc", "adef"
-- Find & symbols that are not HTML entities
pattern = regex.new('&(?!amp;|lt;|gt;|#)')
-- Matches bare "&" but not "&", "<", etc.Lookbehind assertions check if a pattern matches behind without consuming characters:
| Assertion | Description |
|---|---|
(?<=...) |
Positive lookbehind: succeeds if pattern matches behind |
(?<!...) |
Negative lookbehind: succeeds if pattern does NOT match behind |
Examples:
-- Match "a" only if preceded by "bc" or "de"
pattern = regex.new('(?<=bc|de)a')
-- Matches: "bca" (captures "a"), "dea" (captures "a")
-- Does NOT match: "xa"
-- Match "a" only if NOT preceded by "bc" or "de"
pattern = regex.new('(?<!bc|de)a')
-- Matches: "xa" (captures "a")
-- Does NOT match: "bca", "dea"Assertions can be combined for complex matching:
-- Match words between 3-6 letters containing at least one vowel
pattern = regex.new('\\b(?=\\w*[aeiou])\\w{3,6}\\b', regex.ICASE)
-- Match integer strings that are not part of larger numbers
pattern = regex.new('(?<!\\d)\\d+(?!\\d)')Kōtuku's regex implementation provides full Unicode support with UTF-8 encoding enabled by default.
Unicode properties match characters based on their Unicode characteristics using \p{...} and \P{...}:
| Pattern | Description |
|---|---|
\p{Property} |
Matches characters with the specified Unicode property |
\P{Property} |
Matches characters without the specified Unicode property |
Match characters from specific writing systems:
-- Match Latin characters
pattern = regex.new('\\p{sc=Latin}+')
-- Match Greek characters
pattern = regex.new('\\p{Script=Greek}+')
-- Match characters used in Latin or Common scripts
pattern = regex.new('\\p{scx=Latin}+')Common scripts: Latin, Greek, Cyrillic, Han, Arabic, Hebrew, Hiragana, Katakana, etc.
Match characters by their general category:
| Property | Description | Examples |
|---|---|---|
\p{Lu} |
Uppercase letter | A, B, Z, À, Ω |
\p{Ll} |
Lowercase letter | a, b, z, à, ω |
\p{Lt} |
Titlecase letter | Dž, Lj, Nj |
\p{L} |
Any letter (Lu|Ll|Lt|Lm|Lo) | All letters |
\p{Nd} |
Decimal number | 0-9, ০-৯ |
\p{N} |
Any number (Nd|Nl|No) | All numbers |
\p{P} |
Punctuation | ., !, ?, ; |
\p{S} |
Symbol | $, +, =, © |
\p{Z} |
Separator | Space, non-breaking space |
\p{C} |
Other (control, format, etc.) | Control characters |
Examples:
-- Match any letter in any script
pattern = regex.new('\\p{L}+')
-- Match digits in any script
pattern = regex.new('\\p{Nd}+')
-- Match all punctuation
pattern = regex.new('\\p{P}+')Binary properties have true/false values:
-- Match whitespace characters
pattern = regex.new('\\p{White_Space}+')
-- Match emoji
pattern = regex.new('\\p{Emoji}')
-- Match characters used in identifiers
pattern = regex.new('\\p{ID_Start}\\p{ID_Continue}*')Properties can be specified in several formats:
-- Short form
\\p{Lu} -- Uppercase letter
\\p{sc=Latin} -- Latin script
-- Long form
\\p{Script=Latin}
\\p{General_Category=Uppercase_Letter}
-- Binary properties
\\p{Emoji}
\\p{White_Space}For a complete list of available properties, see the ECMAScript Unicode Property Table.
Some Unicode properties match sequences of multiple characters (string properties). These can be used in character classes except negated classes:
-- Valid: string property in positive class
pattern = regex.new('[\\p{RGI_Emoji}]')
-- INVALID: string property with negation
pattern = regex.new('[^\\p{RGI_Emoji}]') -- Throws error_complement
-- INVALID: string property with \P{...}
pattern = regex.new('\\P{RGI_Emoji}') -- Throws error_complementWhen case-insensitive matching is enabled with the icase flag, Unicode case folding rules apply:
pattern = regex.new('café', regex.ICASE)
-- Matches: "café", "CAFÉ", "Café", "cAfÉ", etc.Case folding follows Unicode rules, which may match more characters than simple ASCII uppercasing/lowercasing:
pattern = regex.new('ß', regex.ICASE)
-- Matches: "ß" and "SS" (German sharp S case-folds to SS)Character classes operate on Unicode code points:
-- Match all characters in Basic Multilingual Plane
pattern = regex.new('[\\u0000-\\uFFFF]+')
-- Match emoji range (partial)
pattern = regex.new('[\\u{1F600}-\\u{1F64F}]+')The regex engine validates UTF-8 sequences:
-
Trailing bytes must be in range 0x80-0xBF. Invalid trailing bytes cause matching to fail at that position.
-
Code points must be ≤ 0x10FFFF. Values exceeding this cause matching to fail.
-
Non-shortest forms are rejected. For example, U+0030 (digit '0') must be encoded as 0x30, not as the longer forms 0xC0 0xB0 or 0xE0 0x80 0xB0.
At pattern compile time, invalid UTF-8 throws error_utf8. At matching time, invalid UTF-8 leads to match failure at that position.
Compilation flags affect how a regex pattern is compiled and interpreted. These flags are specified when creating a regex object.
| Flag | Effect |
|---|---|
ICASE |
Case-insensitive matching. Matches characters regardless of case using Unicode case-folding rules. |
MULTILINE |
Multiline mode. The ^ and $ anchors match at line boundaries (after/before line terminators) in addition to string boundaries. |
DOT_ALL |
Dotall (singleline) mode. The . metacharacter matches line terminators (U+000A, U+000D, U+2028, U+2029) in addition to all other characters. |
The exact syntax for specifying flags depends on the language binding:
Tiri:
pattern = regex.new('hello', regex.ICASE)
pattern = regex.new('.*', regex.DOT_ALL)
pattern = regex.new('^line', regex.MULTILINE | regex.ICASE)C++:
auto pattern = kt::regex("hello", kt::regex::ICASE);
auto pattern = kt::regex(".*", kt::regex::DOT_ALL);
auto pattern = kt::regex("^line", kt::regex::MULTILINE | kt::regex::ICASE);Makes pattern matching case-insensitive using Unicode case-folding:
pattern = regex.new('hello', regex.ICASE)
-- Matches: "hello", "HELLO", "Hello", "HeLLo", etc.
pattern = regex.new('[a-z]+', regex.ICASE)
-- Matches: "abc", "ABC", "aBc", etc.Changes behaviour of ^ and $ anchors to match line boundaries:
pattern = regex.new('^\\w+', regex.MULTILINE)
-- Without MULTILINE: matches word at start of string only
-- With MULTILINE: matches word at start of string AND after each line terminator
text = "first line\nsecond line\nthird line"
pattern = regex.new('^\\w+', regex.MULTILINE)
-- Matches: "first", "second", "third"Makes . match line terminators in addition to all other characters:
pattern = regex.new('.*', regex.DOT_ALL)
-- Without DOT_ALL: .* matches up to (but not including) line terminators
-- With DOT_ALL: .* matches everything including line terminators
text = "line 1\nline 2\nline 3"
pattern = regex.new('.*', regex.DOT_ALL)
match = pattern.match(text)
-- match[1] = "line 1\nline 2\nline 3" (entire string)Note: When DOT_ALL is set, .* will match all remaining characters in the subject string.
Match flags modify the behaviour of matching operations at runtime, after a pattern has been compiled. These flags are passed to matching functions (test, match, search, replace, split).
| Flag | Effect |
|---|---|
NOT_BEGIN_OF_LINE |
Do not treat the beginning of the text as the start of a line (affects ^ in multiline mode) |
NOT_END_OF_LINE |
Do not treat the end of the text as the end of a line (affects $ in multiline mode) |
NOT_BEGIN_OF_WORD |
Do not treat the beginning of the text as the start of a word (affects \b) |
NOT_END_OF_WORD |
Do not treat the end of the text as the end of a word (affects \b) |
NOT_NULL |
Do not match empty sequences |
CONTINUOUS |
Only match at the beginning of the text (anchored search) |
PREV_AVAILABLE |
Indicates that the previous character position is available for lookbehind assertions |
REPLACE_NO_COPY |
In replace operations, do not copy non-matching parts of the text |
REPLACE_FIRST_ONLY |
In replace operations, replace only the first occurrence |
Tiri:
pattern = regex.new('\\w+')
-- Replace only first occurrence
result = pattern.replace('hello world', 'goodbye', regex.REPLACE_FIRST_ONLY)
-- result = "goodbye world"
-- Match only at beginning
match = pattern.match('hello world', regex.CONTINUOUS)
-- Succeeds (starts at beginning)
match = pattern.match(' hello', regex.CONTINUOUS)
-- Fails (does not start at beginning)Useful when matching in the middle of a larger text:
pattern = regex.new('^hello', regex.MULTILINE)
-- Normal matching
pattern.test('hello') -- true (at beginning)
-- With NOT_BEGIN_OF_LINE
pattern.test('hello', regex.NOT_BEGIN_OF_LINE) -- false (not treated as line start)Prevents matching empty strings:
pattern = regex.new('a*')
-- Normal: matches empty string
pattern.test('') -- true
-- With NOT_NULL: rejects empty match
pattern.test('', regex.NOT_NULL) -- falseForces match to start at the beginning of the text:
pattern = regex.new('\\d+')
-- Normal: finds "123" anywhere
pattern.match(' 123') -- Matches "123"
-- With CONTINUOUS: must start at position 0
pattern.match(' 123', regex.CONTINUOUS) -- Fails
pattern.match('123', regex.CONTINUOUS) -- SucceedsAffects replace operations by excluding non-matching text:
pattern = regex.new('\\d+')
-- Normal replace: keeps non-matching text
pattern.replace('a123b456c', 'X') -- "aXbXc"
-- With REPLACE_NO_COPY: only includes replacements
pattern.replace('a123b456c', 'X', regex.REPLACE_NO_COPY) -- "XX"Limits replacement to the first match:
pattern = regex.new('\\d+')
-- Normal replace: replaces all
pattern.replace('123 456 789', 'X') -- "X X X"
-- With REPLACE_FIRST_ONLY: replaces only first
pattern.replace('123 456 789', 'X', regex.REPLACE_FIRST_ONLY) -- "X 456 789"Kōtuku's regex implementation is based on the ECMAScript specification and provides the following characteristics:
The implementation supports expressions defined in the ECMAScript Specification (latest draft), including:
- ECMAScript 2018 (ES9): Named capture groups, lookbehind assertions, Unicode property escapes
- ECMAScript 2025: Duplicate named capture groups, bounded flag modifiers
- Set operations for character classes (intersection, subtraction, string sequences)
-
No
\Q...\Eliteral sequences: Use explicit escaping instead - No possessive quantifiers: Use atomic groups or lookahead for equivalent behaviour
- No recursive patterns: Not supported in ECMAScript
- No conditional patterns: Use alternation with lookahead instead
- Different Unicode categories: Follow ECMAScript Unicode property names
- No balanced groups: Named captures cannot be reused except in alternatives
-
No inline comments:
(?#...)is not supported -
Different flag syntax: Uses ECMAScript
(?ims:...)instead of(?imnsx-imnsx:...)
-
No POSIX character classes: Use Unicode properties instead (e.g.,
\p{Alpha}instead of[[:alpha:]]) -
No collating sequences:
[.ch.]not supported -
No equivalence classes:
[=e=]not supported
Backreferences can appear before their corresponding groups:
pattern = regex.new('\\1(abc)') -- ValidThis is valid in ECMAScript but may fail or behave differently in other engines.
Backreferences to groups that haven't captured anything match the empty string:
pattern = regex.new('(a)?b\\1')
-- Matches "ab" (group 1 captured nothing, so \1 matches empty string)The ECMAScript specification does not define octal escape sequences like \ooo or \0ooo (except \0 for NULL):
-- Valid
pattern = regex.new('\\0') -- Matches NULL (U+0000)
-- Invalid (not defined by ECMAScript)
pattern = regex.new('\\101') -- Error: invalid escapeUse hexadecimal or Unicode escapes instead:
pattern = regex.new('\\x41') -- 'A' in hexadecimal
pattern = regex.new('\\u0041') -- 'A' in UnicodeSome operations not directly supported can be achieved through alternative patterns:
Intersection (Alternative Method):
-- Direct: [\p{sc=Latin}&&\p{Ll}]
-- Alternative: using lookahead
(?=\\p{sc=Latin})\\p{Ll}Subtraction (Alternative Method):
-- Direct: [\p{sc=Latin}--\p{Ll}]
-- Alternative: using negative lookahead
(?!\\p{Ll})\\p{sc=Latin}Atomic Groups:
-- Perl/PCRE: (?>pattern)
-- ECMAScript equivalent: (?=(pattern))\1Regex patterns should be compiled once and reused:
Inefficient:
for i = 1, 10000 do
pattern = regex.new('\\d+') -- Compiles pattern 10,000 times
pattern.test(data[i])
endEfficient:
pattern = regex.new('\\d+') -- Compiles pattern once
for i = 1, 10000 do
pattern.test(data[i]) -- Reuses compiled pattern
endStore frequently used patterns in variables (local or global) using deferred expressions rather than recreating them:
-- Compiled patterns
emailPattern = <{ regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}') }>
phonePattern = <{ regex.new('\\d{3}-\\d{3}-\\d{4}') }>
datePattern = <{ regex.new('\\d{4}-\\d{2}-\\d{2}') }>
-- Use patterns multiple times efficiently
for contact in values(contacts) do
if emailPattern.test(contact.email) then
processEmail(contact)
end
if phonePattern.test(contact.phone) then
processPhone(contact)
end
endNon-greedy quantifiers can improve performance in some cases:
-- Greedy: tries to match as much as possible, then backtracks
pattern = regex.new('<.*>')
-- Matches: "<tag>content</tag>" as one match (backtracks from end)
-- Non-greedy: stops at first opportunity
pattern = regex.new('<.*?>')
-- Matches: "<tag>" and "</tag>" separately (no backtracking)For HTML/XML parsing, non-greedy is typically faster:
-- Extract tag content efficiently
pattern = regex.new('<([^>]+)>(.*?)</\\1>')Certain patterns can cause exponential time complexity:
Dangerous Pattern:
-- Exponential backtracking on non-match
pattern = regex.new('(a+)+b')
text = 'aaaaaaaaaaaaaaaaaac' -- No 'b' at end
-- This takes exponential time as pattern length increasesSolutions:
-
Use possessive-like behaviour:
-- Prevent backtracking with atomic group simulation pattern = regex.new('(?=(a+))\\1+b')
-
Use negated character classes:
-- Clearer intent, better performance pattern = regex.new('[^b]+b')
-
Be specific about what you're matching:
-- Instead of: .* -- Use: [^<]+ (if not matching '<') -- Use: \\w+ (if matching word characters)
Use predefined classes when possible:
-- Faster
pattern = regex.new('\\d+')
-- Slower (equivalent but not optimised)
pattern = regex.new('[0-9]+')Simplify complex classes:
-- Complex
pattern = regex.new('[A-Za-z0-9_]+')
-- Simpler and equivalent
pattern = regex.new('\\w+')Anchor patterns to reduce search space:
-- Unanchored: searches entire string
pattern = regex.new('\\d+')
-- Anchored: only checks from beginning
pattern = regex.new('^\\d+')
-- Anchored both ends: exact match only
pattern = regex.new('^\\d+$')Unicode properties are optimised internally, but broad categories are faster than specific scripts:
-- Faster: general category
pattern = regex.new('\\p{L}+') -- All letters
-- Slower: specific script
pattern = regex.new('\\p{sc=Latin}+') -- Latin letters only- Compile patterns once, reuse many times
- Store patterns in variables
- Use non-greedy quantifiers when appropriate
-
Anchor patterns when possible (
^,$) - Avoid nested quantifiers that can cause exponential backtracking
-
Use predefined character classes (
\d,\w,\s) - Be specific in patterns to reduce backtracking
- Test performance with realistic data
This section provides practical regex patterns for common use cases.
Basic email pattern:
pattern = regex.new('[\\w._%+-]+@[\\w.-]+\\.[A-Za-z]{2,}')
-- Matches: user@example.com, first.last@sub.domain.co.ukExplanation:
-
[\w._%+-]+- Username: word characters, dots, underscores, percent, plus, hyphen -
@- Literal @ symbol -
[\w.-]+- Domain name: word characters, dots, hyphens -
\.- Literal dot -
[A-Za-z]{2,}- Top-level domain: 2 or more letters
More strict pattern:
pattern = regex.new('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$')
-- Anchored to match entire stringBasic URL pattern:
pattern = regex.new('(https?)://([^/\\s]+)([^\\s]*)')
-- Captures: protocol, domain, path
match = pattern.match('https://example.com/path?query=value')
-- match[1] = "https://example.com/path?query=value" (full match)
-- match[2] = "https" (protocol)
-- match[3] = "example.com" (domain)
-- match[4] = "/path?query=value" (path)With named captures:
pattern = regex.new('(?<protocol>https?)://(?<domain>[^/\\s]+)(?<path>[^\\s]*)')
match = pattern.match('https://example.com/path')
-- Access by name: match.domain (language binding dependent)
-- Access by number: match[3]US phone number:
-- Format: 555-123-4567
pattern = regex.new('\\d{3}-\\d{3}-\\d{4}')
-- With optional country code: +1-555-123-4567
pattern = regex.new('(\\+1-)?\\d{3}-\\d{3}-\\d{4}')
-- With optional separators (-, ., space, or none)
pattern = regex.new('\\d{3}[-. ]?\\d{3}[-. ]?\\d{4}')International E.164 format:
-- +1234567890 to +123456789012345
pattern = regex.new('\\+\\d{1,15}')ISO 8601 date (YYYY-MM-DD):
pattern = regex.new('\\d{4}-\\d{2}-\\d{2}')
-- Matches: 2025-10-14
-- With validation (basic):
pattern = regex.new('\\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])')
-- Validates month (01-12) and day (01-31)US date format (MM/DD/YYYY):
pattern = regex.new('(0[1-9]|1[0-2])/(0[1-9]|[12]\\d|3[01])/\\d{4}')
-- Matches: 10/14/2025Flexible date format:
pattern = regex.new('\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}')
-- Matches: 10/14/2025, 10-14-25, 1/5/202524-hour time (HH:MM):
pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d')
-- Matches: 09:30, 23:59, 8:05
-- With optional seconds:
pattern = regex.new('([01]?\\d|2[0-3]):[0-5]\\d(:[0-5]\\d)?')
-- Matches: 09:30, 09:30:4512-hour time with AM/PM:
pattern = regex.new('(0?[1-9]|1[0-2]):[0-5]\\d\\s*([AaPp][Mm])')
-- Matches: 9:30 AM, 12:45 PM, 9:30AMMinimum requirements (8+ chars, 1 uppercase, 1 lowercase, 1 digit):
pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d).{8,}$')Explanation:
-
^- Start of string -
(?=.*[a-z])- Lookahead: at least one lowercase -
(?=.*[A-Z])- Lookahead: at least one uppercase -
(?=.*\d)- Lookahead: at least one digit -
.{8,}- At least 8 characters -
$- End of string
With special character requirement:
pattern = regex.new('^(?=.*[a-z])(?=.*[A-Z])(?=.*\\d)(?=.*[@$!%*?&]).{8,}$')IPv4 address:
pattern = regex.new('\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b')
-- Matches: 192.168.1.1, 10.0.0.1
-- With validation (0-255 per octet):
pattern = regex.new('\\b(?:(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\.){3}(?:25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\\b')IPv6 address (simplified):
pattern = regex.new('(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}')
-- Matches full IPv6: 2001:0db8:85a3:0000:0000:8a2e:0370:7334Match opening and closing tags:
pattern = regex.new('<([a-zA-Z][a-zA-Z0-9]*)\\b[^>]*>(.*?)</\\1>')
-- Matches: <div>content</div>, <span class="x">text</span>
-- Captures: tag name (group 1), content (group 2)Extract tag content:
pattern = regex.new('<[^>]+>(.*?)</[^>]+>')
-- Captures content between any tagsMatch self-closing tags:
pattern = regex.new('<[a-zA-Z][a-zA-Z0-9]*\\b[^>]*/>')
-- Matches: <br/>, <img src="x" />Basic CSV field:
pattern = regex.new('([^,]+),?')
-- Matches fields separated by commasCSV with quoted fields:
pattern = regex.new('(?:^|,)(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|([^,]*))')
-- Handles: "quoted field", unquoted, "field with ""quotes"""Extract words:
pattern = regex.new('\\b\\w+\\b')
-- Matches: any word (alphanumeric + underscore)
pattern = regex.new('\\b[A-Za-z]+\\b')
-- Matches: only alphabetic wordsExtract words with apostrophes:
pattern = regex.new('\\b[A-Za-z]+(?:\'[A-Za-z]+)?\\b')
-- Matches: don't, it's, can't, etc.Integer:
pattern = regex.new('-?\\d+')
-- Matches: 123, -456Floating point:
pattern = regex.new('-?\\d+\\.\\d+')
-- Matches: 123.45, -67.89
-- With optional decimal part:
pattern = regex.new('-?\\d+(?:\\.\\d+)?')
-- Matches: 123, 123.45, -67.89Scientific notation:
pattern = regex.new('-?\\d+(?:\\.\\d+)?(?:[eE][+-]?\\d+)?')
-- Matches: 1.23e10, -4.5E-6, 123Trim leading/trailing whitespace:
pattern = regex.new('^\\s+|\\s+$')
-- Use with replace to remove leading/trailing spacesCollapse multiple spaces:
pattern = regex.new('\\s+')
-- Replace with single space to normalize whitespaceSplit on whitespace:
pattern = regex.new('\\s+')
-- Use with split to separate wordsUnix/Linux path:
pattern = regex.new('^(/[^/]+)+/?$')
-- Matches: /home/user/file.txt, /usr/local/bin/Windows path:
pattern = regex.new('^[A-Za-z]:\\\\(?:[^\\\\/:*?\"<>|]+\\\\)*[^\\\\/:*?\"<>|]*$')
-- Matches: C:\Users\Name\file.txtFile extension:
pattern = regex.new('\\.([A-Za-z0-9]+)$')
-- Captures file extension: .txt, .pdf, .jpgSemantic versioning:
pattern = regex.new('^(0|[1-9]\\d*)\\.(0|[1-9]\\d*)\\.(0|[1-9]\\d*)(?:-((?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\\.(?:0|[1-9]\\d*|\\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\\+([0-9a-zA-Z-]+(?:\\.[0-9a-zA-Z-]+)*))?$')
-- Matches: 1.0.0, 2.1.3, 1.0.0-alpha.1, 1.0.0+build.123Simple version:
pattern = regex.new('\\d+\\.\\d+(?:\\.\\d+)?')
-- Matches: 1.0, 1.0.5, 2.10.1When pattern compilation or matching fails, specific error types indicate the nature of the problem. Understanding these errors helps diagnose and fix pattern issues.
These errors occur when compiling a regex pattern:
| Error | Description | Example |
|---|---|---|
error_escape |
Invalid escape sequence |
\q (undefined escape), \c (not followed by letter), \x (not followed by two hex digits), \u{GGGG} (invalid hex) |
error_brack |
Mismatched square brackets |
[abc, abc], [a[b] (nested) |
error_paren |
Mismatched parentheses |
(abc, abc), ((a) (unclosed) |
error_brace |
Mismatched curly braces |
a{3, a3}, a{2,} (missing closing brace) |
error_badbrace |
Invalid quantifier range |
{3,2} (n > m), {-1} (negative), {,5} (missing n) |
error_range |
Invalid character range in class |
[z-a] (reversed), [\u0100-\u0010] (start > end) |
error_backref |
Invalid backreference |
\9 (group doesn't exist), \k<name> (name doesn't exist) |
error_modifier |
Invalid flag modifier |
(?ii:...) (duplicate flag), (?i-i:...) (contradictory) |
error_operator |
Invalid set operator usage |
[AB--CD] (mixed operators at same level), !! (reserved double punctuator in class) |
error_noescape |
Character must be escaped |
[(] (should be [\(]), [{] (should be [\{]) in character classes |
error_complement |
Invalid negation |
[^\p{RGI_Emoji}] (string property in negated class), \P{RGI_Emoji} (string property with \P) |
error_badrepeat |
Quantifier without preceding expression |
*abc (starts with quantifier), a** (double quantifier) |
error_utf8 |
Invalid UTF-8 sequence in pattern | Pattern contains invalid UTF-8 bytes, overlong encoding, or code point > U+10FFFF |
-- Invalid: \q is not defined
pattern = regex.new('\\q') -- Error: invalid escape sequence
-- Invalid: \c not followed by letter
pattern = regex.new('\\c5') -- Error: expected A-Z or a-z after \c
-- Invalid: \x not followed by two hex digits
pattern = regex.new('\\xGG') -- Error: expected two hex digits
-- Invalid: code point exceeds maximum
pattern = regex.new('\\u{110000}') -- Error: code point > U+10FFFF
-- Valid alternatives:
pattern = regex.new('q') -- Literal q
pattern = regex.new('\\x71') -- Hex escape for q
pattern = regex.new('\\u0071') -- Unicode escape for q-- Invalid: unclosed bracket
pattern = regex.new('[abc') -- Error: missing ]
-- Invalid: extra closing bracket
pattern = regex.new('abc]') -- Error: unmatched ]
-- Valid:
pattern = regex.new('[abc]') -- Correct bracket pair
pattern = regex.new('\\]') -- Escaped bracket (literal)-- Invalid: unclosed parenthesis
pattern = regex.new('(abc') -- Error: missing )
-- Invalid: extra closing parenthesis
pattern = regex.new('abc)') -- Error: unmatched )
-- Valid:
pattern = regex.new('(abc)') -- Correct parenthesis pair
pattern = regex.new('\\(abc\\)') -- Escaped parentheses (literals)-- Invalid: unclosed brace
pattern = regex.new('a{3') -- Error: missing }
-- Valid:
pattern = regex.new('a{3}') -- Correct quantifier
pattern = regex.new('\\{3\\}') -- Escaped braces (literals)-- Invalid: n > m in range
pattern = regex.new('a{5,3}') -- Error: 5 > 3
-- Invalid: missing n
pattern = regex.new('a{,5}') -- Error: must specify n
-- Valid:
pattern = regex.new('a{3,5}') -- n ≤ m
pattern = regex.new('a{3,}') -- n or more (no maximum)
pattern = regex.new('a{3}') -- exactly n-- Invalid: reversed range
pattern = regex.new('[z-a]') -- Error: z (U+007A) > a (U+0061)
-- Invalid: empty range
pattern = regex.new('[\\u0100-\\u0010]') -- Error: start > end
-- Valid:
pattern = regex.new('[a-z]') -- Correct range
pattern = regex.new('[z]') -- Single character (no range)-- Invalid: group doesn't exist
pattern = regex.new('\\5') -- Error: no group 5
-- Invalid: named group doesn't exist
pattern = regex.new('\\k<missing>') -- Error: no group named 'missing'
-- Valid:
pattern = regex.new('(a)\\1') -- Backreference to group 1
pattern = regex.new('(?<x>a)\\k<x>') -- Named backreference-- Invalid: duplicate flag
pattern = regex.new('(?ii:abc)') -- Error: 'i' appears twice
-- Invalid: contradictory flags
pattern = regex.new('(?i-i:abc)') -- Error: both +i and -i
-- Valid:
pattern = regex.new('(?i:abc)') -- Single flag
pattern = regex.new('(?im:abc)') -- Multiple different flags
pattern = regex.new('(?i-m:abc)') -- Enable and disable flags-- Invalid: mixed operators at same level
pattern = regex.new('[AB--CD]') -- Error: union (AB) then subtraction
-- Invalid: reserved double punctuator
pattern = regex.new('[a-z!!]') -- Error: !! is reserved
-- Valid:
pattern = regex.new('[[AB]--[CD]]') -- Nested classes
pattern = regex.new('[A[B--C]D]') -- Operator in nested level
pattern = regex.new('[a-z\\!\\!]') -- Escaped (two separate !)-- Invalid: ( must be escaped in character class
pattern = regex.new('[(]') -- Error: must escape (
-- Invalid: { must be escaped
pattern = regex.new('[{]') -- Error: must escape {
-- Valid:
pattern = regex.new('[\\(]') -- Escaped (
pattern = regex.new('[\\{\\}]') -- Escaped braces-- Invalid: string property in negated class
pattern = regex.new('[^\\p{RGI_Emoji}]') -- Error: cannot negate string property
-- Invalid: string property with \P
pattern = regex.new('\\P{RGI_Emoji}') -- Error: \P doesn't support string properties
-- Valid:
pattern = regex.new('[\\p{RGI_Emoji}]') -- String property in positive class
pattern = regex.new('\\P{Emoji}') -- Character property (not string)
pattern = regex.new('[^\\p{Emoji}]') -- Character property negated-- Invalid: quantifier at start
pattern = regex.new('*abc') -- Error: nothing to repeat
-- Invalid: double quantifier
pattern = regex.new('a**') -- Error: quantifier on quantifier
-- Valid:
pattern = regex.new('a*bc') -- Quantifier after character
pattern = regex.new('\\*abc') -- Escaped * (literal)-- Invalid UTF-8 in pattern
-- (This typically occurs when pattern strings contain invalid byte sequences)
-- Invalid: overlong encoding
pattern = regex.new('\\xC0\\xB0') -- Error: overlong form of U+0030
-- Valid:
pattern = regex.new('\\x30') -- Shortest form
pattern = regex.new('\\u0030') -- Unicode escapeTiri:
try
pattern = regex.new('[invalid')
except ex
print('Pattern compilation failed: ' .. ex.message)
success
-- Use pattern
end- Test patterns incrementally: Build complex patterns step by step, testing each addition
- Use online regex testers: Many tools visualise patterns and highlight errors (ensure they support ECMAScript syntax)
- Check bracket matching: Count opening and closing brackets/parentheses/braces
-
Validate escape sequences: Ensure all
\sequences are valid - Review operator precedence: Verify set operations are properly nested
-
Examine Unicode sequences: Confirm
\u{...}values are valid code points - Test with edge cases: Try empty strings, very long strings, and strings with special characters
Forgetting to escape special characters:
-- Wrong: . matches any character
pattern = regex.new('file.txt')
-- Matches: "file.txt", "file?txt", "fileXtxt"
-- Correct: \. matches literal dot
pattern = regex.new('file\\.txt')
-- Matches: "file.txt" onlyIncorrect bracket nesting:
-- Wrong: brackets don't nest this way
pattern = regex.new('[[a-z]') -- Error
-- Correct: nest with operators
pattern = regex.new('[[a-m][n-z]]') -- Union of two rangesQuantifier on quantifier:
-- Wrong: double quantifier
pattern = regex.new('a*+') -- Error
-- Correct: quantify group
pattern = regex.new('(a*)+')This manual has covered the complete regular expression syntax and features supported by Kōtuku:
- Character matching including Unicode escapes and special characters
- Character classes with ranges, predefined classes, and set operations
- Quantifiers for controlling repetition (greedy and non-greedy)
- Groups and backreferences for capturing and reusing matched text
- Assertions for zero-width matching conditions
- Unicode support with full UTF-8 and property matching
- Flags for controlling compilation and matching behaviour
- Performance considerations for efficient pattern usage
- Common patterns for practical applications
- Error handling for debugging pattern issues
For API documentation on the Regex class and its methods, please refer to the Regex module documentation in the Kōtuku API reference.
This manual documents the regex implementation as of 2026. For updates and the latest specification, refer to the ECMAScript Specification.