Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

997 lines (661 sloc) 26.5 KB

Regular expressions are a computer science concept where simple patterns describe the format of text. Pattern matching is the process of applying these patterns to actual text to look for matches. Most modern regular expression facilities are more powerful than traditional regular expressions due to the influence of languages such as Perl, but the short-hand term regex has stuck and continues to mean "regular expression-like pattern matching". In Perl 6, although the specific syntax used to describe the patterns is different from PCREPerl Compatible Regular Expressions and POSIXPortable Operating System Interface for Unix. See IEEE standard 1003.1-2001, we continue to call them regex.

A common writing error is to duplicate a word by accident. It is hard to catch such errors by re-reading your own text, but Perl can do it for you using regex:

The simplest case of a regex is a constant string. Matching a string against that regex searches for that string:

The construct m/ ... / builds a regex. A regex on the right hand side of the ~~ smart match operator applies against the string on the left hand side. By default, whitespace inside the regex is irrelevant for the matching, so writing the regex as m/ perl /, m/perl/ or m/ p e rl/ all produce the exact same semantics--although the first way is probably the most readable.

Only word characters, digits, and the underscore cause an exact substring search. All other characters may have a special meaning. If you want to search for a comma, an asterisk, or another non-word character, you must quote or escape itTo search for a literal string--without using the pattern matching features of regex--consider using index or rindex instead.:

Searching for literal strings gets boring pretty quickly. Regex support special (also called metasyntactic) characters. The dot (.) matches a single, arbitrary character:

This prints:

The dot matched an l, r, and n, but it will also match a space in the sentence the spectroscope lacks resolution--regexes ignore word boundaries by default.

The special variable $/ stores the match object, which allows you inspect the matched text.

Suppose you want to solve a crossword puzzle. You have a word list and want to find words containing pe, then an arbitrary letter, and then an l (but not a space, as your puzzle has extra markers for those). The appropriate regex for that is m/pe \w l/. The \w control sequence stands for a "Word" character--a letter, digit, or an underscore. This chapter's example uses \w to build the definition of a "word".

Several other common control sequences each match a single character; you can find a list of those in regexBackslash.

Invert the sense of each of these backslash sequences by uppercasing its letter: \W matches a character that's not a word character and \N matches a single character that's not a newline.

These matches extend beyond the ASCII range--\d matches Latin, Arabic-Indic, Devanagari and other digits, \s matches non-breaking whitespace, and so on. These character classes follow the Unicode definition of what is a letter, a number, and so on.

To define your own custom character classes, listing the appropriate characters inside nested angle and square brackets <[ ... ]>:

Rather than listing each character in the character class individually, you may specify a range of characters by placing the range operator .. between the beginning and ending characters:

You may add characters to or subtract characters from classes with the + and - operators:

The negated character class is a special application of this idea.

Inside a character class, non-word characters do not need to be escaped, and generally lose their special meaning. So /<[+.*]> / matches a plus sign, a dot or an asterisk. The only exceptions to that are the backslash, square brackets and the dash -, which need to be escaped with a backslash:

A quantifier specifies how often something has to occur. A question mark ? makes the preceding unit (be it a letter, a character class, or something more complicated) optional, meaning it can either be present either zero or one times. m/ho u? se/ matches either house or hose. You can also write the regex as m/hou?se/ without any spaces, and the ? will still quantify only the u.

The asterisk * stands for zero or more occurrences, so m/z\w*o/ can match zo, zoo, zero and so on. The plus + stands for one or more occurrences, \w+ usually matches what you might consider a word (though only matches the first three characters from isn't because ' isn't a word character).

The most general quantifier is **. When followed by a number, it matches that many times. When followed by a range, it can match any number of times that the range allows:

One can specify a separator with % after the quantifier:

   '1,2,3' ~~ / \d+ % ',' /

The separator is matched between two occurrences of the quantified regex. The separator can itself be a regex.

If a quantifier has several ways to match, Perl will choose the longest one. This is greedy matching. Appending a question mark to a quantifier makes it non-greedyThe non-greedy general quantifier is $thing **? $count, so the question mark goes directly after the second asterisk.

For example, you can parse HTML very badlyUsing a proper stateful parser is always more accurate. with the code:

To apply a modifier to more than just one character or character class, group items with square brackets:

Here \w+ matches a word, and [\w+]+ % [\,\s*] matches at least one word, where several words are separated by a comma and an arbitrary amount of whitespace.

Separate alternations--parts of a regex of which any can match--with vertical bars. One vertical bar between multiple parts of a regex means that the alternatives are tried in parallel and the longest matching alternative wins. Two bars make the regex engine try each alternative in order and the first matching alternative wins.


So far every regex could match anywhere within a string. Often it is useful to limit the match to the start or end of a string or line or to word boundaries. A single caret ^ anchors the regex to the start of the string and a dollar sign $ to the end. m/ ^a / matches strings beginning with an a, and m/ ^ a $ / matches strings that consist only of an a.


Regex can be very useful for extracting information too. Surrounding part of a regex with round brackets (aka parentheses) (...) makes Perl capture the string it matches. The string matched by the first group of parentheses is available in $/[0], the second in $/[1], etc. $/ acts as an array containing the captures from each parentheses group:

If you quantify a capture, the corresponding entry in the match object is a list of other match objects:

The editor in me wants to fix this example to use the serial comma.

This prints:

The first capture, (\w+), was quantified, so $/[0] contains a list of words. The code calls .join to turn it into a string. Regardless of how many times the first capture matches (and how many elements are in $/[0]), the second capture is still available in $/[1].

As a shortcut, $/[0] is also available under the name $0, $/[1] as $1, and so on. These aliases are also available inside the regex. This allows you to write a regex that detects that common error of duplicated words, just like the example at the beginning of this chapter:

The regex first anchors to a left word boundary with « so that it doesn't match partial duplication of words. Next, the regex captures a word ((\w+)), followed by at least one non-word character \W+. This implies a right word boundary, so there is no need to use an explicit boundary. Then it matches the previous capture followed by a right word boundary.

Without the first word boundary anchor, the regex would for example match strand and beach or lathe the table leg. Without the last word boundary anchor it would also match the theory.

Named regexes

You can declare regexes just like subroutines--and even name them. Suppose you found the example at the beginning of this chapter useful and want to make it available easily. Suppose also you want to extend it to handle contractions such as doesn't or isn't:

This code introduces a regex named word, which matches at least one word character, optionally followed by a single quote and some more word characters. Another regex called dup (short for duplicate) contains a word boundary anchor.

Within a regex, the syntax <&word> locates the regex word within the current lexical scope and matches against the regex. The <name=&regex> syntax creates a capture named name, which records what &regex matched in the match object.

In this example, dup calls the word regex, then matches at least one non-word character, and then matches the same string as previously matched by the regex word. It ends with another word boundary. The syntax for this backreference is a dollar sign followed by the name of the capture in angle bracketsIn grammars--see (grammars)--<word> looks up a regex named word in the current grammar and parent grammars, and creates a capture of the same name..

Within the if block, $<dup> is short for $/{'dup'}. It accesses the match object that the regex dup produced. dup also has a subrule called word. The match object produced from that call is accessible as $<dup><word>.

Named regexes make it easy to organize complex regexes by building them up from smaller pieces.


The previous example to match a list of words was:

This works, but the repeated "I don't care about whitespace" units are clumsy. The desire to allow whitespace anywhere in a string is common. Perl 6 regexes allow this through the use of the :sigspace modifier (shortened to :s):

This modifier allows optional whitespace in the text wherever there one or more whitespace characters appears in the pattern. It's even a bit cleverer than that: between two word characters whitespace is mandatory. The regex does not match the string eggs, milk, sugarandflour.

The :ignorecase or :i modifier makes the regex insensitive to upper and lower case, so m/ :i perl / matches perl, PerL, and PERL (though who names a programming language in all uppercase letters?)

Backtracking control

In the course of matching a regex against a string, the regex engine may reach a point where an alternation has matched a particular branch or a quantifier has greedily matched all it can, but the final portion of the regex fails to match. In this case, the regex engine backs up and attempts to match another alternative or matches one fewer character of the quantified portion to see if the overall regex succeeds. This process of failing and trying again is backtracking.

When matching m/\w+ 'en'/ against the string oxen, the \w+ group first matches the whole string because of the greediness of +, but then the en literal at the end can't match anything. \w+ gives up one character to match oxe. en still can't match, so the \w+ group again gives up one character and now matches ox. The en literal can now match the last two characters of the string, and the overall match succeeds.

While backtracking is often useful and convenient, it can also be slow and confusing. A colon : switches off backtracking for the previous quantifier or alternation. m/ \w+: 'en'/ can never match any string, because the \w+ always eats up all word characters and never releases them.

The :ratchet modifier disables backtracking for a whole regex, which is often desirable in a small regex called often from other regexes. The duplicate word search regex had to anchor the regex to word boundaries, because \w+ would allow matching only part of a word. Disabling backtracking makes \w+ always match a full word:

The effect of :ratchet applies only to the regex in which it appears. The outer regex will still backtrack, so it can retry the regex word at a different staring position.

The regex { :ratchet ... } pattern is so common that it has its own shortcut: token { ... }. An idiomatic duplicate word searcher might be:

A token with the :sigspace modifier is a rule:


Regexes are also good for data manipulation. The subst method matches a regex against a string. When subst matches, it substitutes the matched portion of the string with its the second operand:

By default, subst performs a single match and stops. The :g modifier tells the substitution to work globally to replace every possible match.

Note the use of rx/ ... / rather than m/ ... / to construct the regex. The former constructs a regex object. The latter constructs the regex object and immediately matches it against the topic variable $_. Using m/ ... / in the call to subst would create a match object and pass it as the first argument, rather than the regex itself.

Other Regex Features

Sometimes you want to call other regexes, but don't want them to capture the matched text. When parsing a programming language you might discard whitespace characters and comments. You can achieve that by calling the regex as <.otherrule>.

If you use the :sigspace modifier, every continuous piece of whitespace calls the built-in rule <.ws>. This use of a rule rather than a character class allows you to define your own version of whitespace characters (see grammars).

Sometimes you just want to peek ahead to check if the next characters fulfill some properties without actually consuming them. This is common in substitutions. In normal English text, you always place a whitespace after a comma. If somebody forgets to add that whitespace, a regex can clean up after the lazy writer:

The word character after the comma is not part of the match, because it is in a look-ahead introduced by <?before ... >. The leading question mark indicates an zero-width assertion: a rule that never consumes characters from the matched string. You can turn any call to a subrule into a zero width assertion. The built-in token <alpha> matches an alphabetic character, so you can rewrite this example as:

A leading exclamation mark negates the meaning, such that the lookahead must not find the regex fragment. Another variant is:

You can also look behind to assert that the string only matches after another regex fragment. This assertion is <?after>. You can write the equivalent of many built-in anchors with look-ahead and look-behind assertions, though they won't be as efficient.

Match objects

Every regex match returns an object of type Match. In boolean context, a match object returns True for successful matches and False for failed ones. Most properties are only interesting after successful matches.

The orig method returns the string that was matched against. The from and to methods return the positions of the start and end points of the match.

In the previous example, the line-and-column function determines the line number in which the match occurred by extracting the string up to the match position ($m.orig.substr(0, $m.from)), splitting it by newlines, and counting the elements. It calculates the column by searching backwards from the match position and calculating the difference to the match position.

Using a match object as an array yields access to the positional captures. Using it as a hash reveals the named captures. In the previous example, $<dup> is a shortcut for $/<dup> or $/{ 'dup' }. These captures are again Match objects, so match objects are really trees of matches.

The caps method returns all captures, named and positional, in the order in which their matched text appears in the source string. The return value is a list of Pair objects, the keys of which are the names or numbers of the capture and the values the corresponding Match objects.

In this case the captures occur in the same order as they are in the regex, but quantifiers can change that. Even so, $/.caps follows the ordering of the string, not of the regex. Any parts of the string which match but not as part of captures will not appear in the values that caps returns.

To access the non-captured parts too, use $/.chunks instead. It returns both the captured and the non-captured part of the matched string, in the same format as caps, but with a tilde ~ as key. If there are no overlapping captures (as occurs from look-around assertions), the concatenation of all the pair values that chunks returns is the same as the matched part of the string.


Hey! The above document had some coding errors, which are explained below:

Around line 1:

Unknown directive: =head0

Around line 3:

A non-empty Z<>

Around line 11:

Deleting unknown formatting code N<>

Deleting unknown formatting code N<>

Around line 56:

Deleting unknown formatting code N<>

Around line 126:

Deleting unknown formatting code A<>

Around line 153:

'=end' without a target? (Should be "=end table")

Around line 352:

Deleting unknown formatting code N<>

Around line 357:

Deleting unknown formatting code N<>

Around line 519:

=end for without matching =begin. (Stack: [empty])

Around line 603:

Deleting unknown formatting code N<>

Jump to Line
Something went wrong with that request. Please try again.