# Flags and Performance

In earlier example we have occasionally used flags that modified the behavior of regular expressions. In this lesson we look at them more systematically.  

Another issue that you may encounter is poor performance of regular expressions against long string.  The examples in these lessons are small, but in principle you may use (complex) regular expressions against strings that are tens or hundreds of megabytes in size. 

In [1]:
from src.setup import *

## Catastrophic Backtracking

The engine underneath a regular expression module can make a large asymptotic difference in performance.  The story here is complicated and involves a lot of computer science and math; moreover, providing better bounds on "pathological" cases sometimes makes worse typical behavior on "normal" cases.  The documentation of the standard library `re` module notes the third party `regex` module which both has enhanced functionality (some more experimental) and uses a different underlying algorithmic that handles the "monsters" vastly better.

A contrived example below will show how rephrasing a regular expression can make a large difference.  We construct two strings of the letter 'a' terminated by either 'Y' or 'Z'.  The tricky part will be when different subpatterns match overlapping sets of patterns.  That in itself is perfectly reasonable and not uncommon

In [2]:
success = 'a'*30 + 'Z'
failure = 'a'*30 + 'Y'

pat1 = re.compile(r'[a-z]+Z')

We might first try to match "a string of lowercase letters followed by 'Z'".  This works fine for both the success case and where it fails at the last character.

In [3]:
%%time
re.match(pat1, success) or "No Match!"

CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 11.7 µs


<re.Match object; span=(0, 31), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZ'>

In [4]:
%%time
re.match(pat1, failure) or "No Match!"

CPU times: user 12 µs, sys: 2 µs, total: 14 µs
Wall time: 17.9 µs


'No Match!'

For a next variation, we want to match:

* a string of (at least one) vowels
* followed one of more of the first five letters of alphabet
* last two subpatterns occurring at least once together
* ending with a 'Z'

In [5]:
pat2 = re.compile(r'([aeiou]+[abcde]+)+Z')
# Example match:'oucieddabcdZ'
# Example failure: 'oucieddaklZ'

In [6]:
%%time
re.match(pat2, success)

CPU times: user 15 µs, sys: 0 ns, total: 15 µs
Wall time: 18.6 µs


<re.Match object; span=(0, 31), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZ'>

In [7]:
%%time
re.match(pat2, failure) or "No Match!"

CPU times: user 47.7 s, sys: 110 ms, total: 47.8 s
Wall time: 47.5 s


'No Match!'

## What Went Wrong?

We started at couple tens of microseconds for either the success or failure case.  For the success case, the time is similar.  But for the failure case we are into tens of **seconds**, i.e. a million times worse.  It gets worse than that though: the complexity is exponential; in other words, the worst case gets worse at a rate of $O(2^N)$ (where N is the length of the target string).  For example, the analagous pattern with 35 'a's takes 20 minutes, and failing to match ten thousand 'a's would take longer than the lifespan of the universe.

The problem is that the regular expression (depending on engine) needs to consider every possible combination of adjacent subpattern matches, and every number of occurrence counts of those, before it can conclude that there is no 'Z' at the end of the initial group.

We can fix this is several different ways, but finding one for your particular problem can be something of a dark art.  

In the example, we might lookahead for something much more general to get to failures quickly.  Anything that could match would need to be a more than one generic alphanumeric character followed by a 'Z'.  This allows the targets with no 'Z' characters at all to fail immediately without recursive backtracking.

In [8]:
pat3 = re.compile(r'(?=\w+Z)([aeiou]+[abcde]+)+Z')

In [9]:
%%time
re.match(pat3, success) or "No Match!"

CPU times: user 20 µs, sys: 0 ns, total: 20 µs
Wall time: 25.7 µs


<re.Match object; span=(0, 31), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaZ'>

In [10]:
%%time
re.match(pat3, failure) or "No Match!"

CPU times: user 19 µs, sys: 0 ns, total: 19 µs
Wall time: 23.6 µs


'No Match!'

Let us see that the improved pattern works even on vastly larger target strings.  A million character string moves us to milliseconds, but not into millennia.

In [11]:
success2 = 'a'*1_000_000 + 'Z'
failure2 = 'a'*1_000_000 + 'Y'

In [12]:
%%time
re.match(pat3, success2) or "No Match!"

CPU times: user 13.5 ms, sys: 0 ns, total: 13.5 ms
Wall time: 14 ms


<re.Match object; span=(0, 1000001), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>

In [13]:
%%time
re.match(pat3, failure2) or "No Match!"

CPU times: user 14.5 ms, sys: 10 µs, total: 14.5 ms
Wall time: 14.5 ms


'No Match!'

## Regular Expression Flags

There are six flags you may use to change the meaning of patterns.  When you use these, you pass them using the keyword argument `flags` of `re.compile()` and of other `re` module functions.  However, the `flags` argument only works if a string rather than compiled pattern is passed to other functions.

Flag      | Short | Description
----------|-------|------------------------------------------------------------------------
ASCII     | A     | `\w`, `\b`, `\s` and `\d` match only on ASCII characters
DOTALL    | S     | `.` matches any character, including newlines
IGNORECASE| I     | Case-insensitive matches
LOCALE    | L     | Locale-aware match
MULTILINE | M     | `^` and `$` match each line, not whole string
VERBOSE   | X     | Enable verbose regular expressions

Flags may be combined by bitwise or'ing them.  Matching per-line, case-insensitively, first words of lines that start with latter part of alphabet:

In [14]:
show(r'^[l-z][a-z]*', rhyme, flags=re.M | re.I)

An additional cryptic means of choosing flags is available.  They can be part of the pattern itself using a prefix to the whole pattern or a group.  The inline pattern codes are any of `aiLmsux`, with `aLu` permitted per individual group as well.  The `u` here means to use Unicode (in contrast to ASCII or LOCALE).

In [15]:
# Entire pattern is case-insensitive and multiline
# The first-letter group is interpreted as ASCII
re.findall(r'(?im)(?a:^[l-z])[a-z]*', rhyme)

['Mary', 'went', 'to']