# Groups, Classes, and Alternation

Regular expressions have several capabilities that add more power than the simple wild cards and quantifiers we saw in the last lesson.  The basic elements to these more complex features are creating character classes, grouping subpatterns, and allowing alternation between different subpatterns.

In [1]:
from src.setup import *

## Character Classes

The first lesson actually showed several character classes that have been given shorter aliases because they are very commonly used.  However, you are not limited to only the bundled aliases.

A character class is defined by an expression inside square brackets that contains several characters and/or ranges of characters.  To express ranges, you must know the encoded order of the characters; for ASCII letters and numbers that is obvious, for extended characters it may be less so.  A character class may be expressed as a negation by beginning it with a caret/circumflex `^`.

We can illustrate this by showing what the equivalent character classes are for the predefined aliases.  Shown here are only the ASCII ranges, which are not complete for non-Latin, or extended-Latin, characters.

 Wildcard | Class            | Behavior
:--------:|------------------|------------------
   \d     | `[0-9]`          | Any decimal digit
   \D     | `[^0-9]`         | Any non-digit character
   \s     | `[ \t\n\r\f\v]`  | Any whitespace character
   \S     | `[^ \t\n\r\f\v]` | Any non-whitespace character
   \w     | `[a-zA-Z0-9_]`   | Any alphanumeric character
   \W     | `[^a-zA-Z0-9_]`  | Any non-alphanumeric character

Continuing with our nursery rhyme example, we might want—for whatever reason—to identify all of those substrings that only use letters from the first half of the English alphabet.

In [2]:
# first half of alphabet
show(r'[A-Ma-m]+', rhyme)

To look at only those substrings using letters from the second half of the English alphabet, we could take either of two approaches.

In [3]:
# second half of alphabet (exclude spaces, comma, etc)
show(r'[^A-Ma-m ,\n]+', rhyme)

In [4]:
# second half of alphabet (alternative)
show(r'[N-Zn-z]+', rhyme)

The ranges can occur in any order within the character class; for example, we listed capital letters first in our classes even though lower-case occurs earlier in ASCII or Unicode.  We can also define arbitrary collections of characters that do not include ranges, or that mix ranges with individual characters.

In [5]:
# arbitrary character class
show(r'[aeioubdlth]+', rhyme)

In [6]:
# Any cap, any vower, second half of lower case
show(r'[A-Zaeioun-z]+', rhyme)

## Other Alphabets

As mentioned, unfamiliar alphabets may not have an obvious character order for you.  Unless the ASCII-only flag is used (`re.ASCII` or `re.A`), for example, all letter-like characters in Unicode are matched by `\w`.  Below is a rough Russian translation of "Mary had a little lamb."

In [7]:
print(рифма)

[*] у Мэри был маленький ягненок!


In [8]:
# Match sequences of letter-like characters, even in Cyrillic
show(r'\w+', рифма)

We might perform the same match using character ranges, but we will need to know the Unicode order of the Cyrillic letters.  The same concern, obviously, applies to any alphabet.

In [9]:
show(r'[А-Яа-я]+', рифма)

The order is more obvious if we use the Unicode code points.  In fact, in the Cyrillic range, the *Russian* upper-case letters occur immediately before the Russian lower-case letters (contrary to the reversed collation and non-contiguity of ASCII Latin characters).  However, languages other than Russian, as well as some uncommon or archaic Russian texts, use some Unicode code points both before and after the range I chose.

As code points, we can replicate the last example as the below.  However, we cannot use raw strings for this if we want the Unicode escapes.

In [10]:
show('[\u0410-\u044f]+', рифма)

More generally, if we wanted to match the entire Cyrillic block we would use a more expansive range.  Whether this is good or bad depends on the purpose (e.g. maybe we want to match Russian but exclude the additional letters in Abkhaz).

In [11]:
show('[\u0400-\u04ff]+', рифма)

As in the English examples, we could look for more limited character ranges.

In [12]:
show(r'[А-Жа-п]+', рифма)

## Grouping

Any subpattern in a regular expression can be grouped together so that the pattern as a whole can be subject to quantification—as well as to alternation that we will look at below.  

Let us look for a particular pattern. We want substrings that start with a consonant, and where each consonant cluster is followed by exactly one vowel, but potentially multiple occurrences of that alternation.  The match is terminated by a word boundary.

In [13]:
pat = r'([Mbcdfghjklmnpqrstvwxz]+[aeiouy])+\b'
show(pat, rhyme)

## Alternation

Along with repetition of patterns, you may also specify alternation among patterns.  In the simplest case, this can be simply a choice among literals.  It is sometimes required, and usually helpful, to put parenthesis around the alternation groups.

In [14]:
show(r'(fleece)|(lamb)|(white)', rhyme)

Often it is useful to use subpatterns within the alternation.  For example, let us look for words that *either*:

* Begin with a vowel and are one or two letters long;
* Begin with a consonent and are 5 or more letters.

In [15]:
# Some letters omitted from classes to shorten
# One initial letter then quantified rest counting the start
show(r'\b([AEIOUaeiou]\w?|[bcdfghlmMnrstvw]\w{4,})\b', rhyme)

We can combine these techniques further.  For example, let us identify all sequences of 3 to 7 words matching our somewhat odd criteria.  That is, we quantify an alternation group.

In [16]:
pat = r'(\b([AEIOUaeiou]\w?|[bcdfghlmMnrstvw]\w{4,})\s+){3,7}'
show(pat, rhyme)

## Verbose Regular Expressions

The last example we looked at is already starting to get pretty dense to read.  Regular expressions can become very complicated to capture complex patterns.  Patterns may be specified in a "verbose" mode if an appropriate flag is used for calls to the regex functions.

In a verbose pattern, whitespace is ignore except when it occurs within a character class.  Moreover, trailing comments on each line are also ignored to allow annotations of the subpatterns.  We can combine several elements of the syntax we have learned to describe URLs in verbose style.  This example is absolutely not robust against everything URLs can contain, but it illustrates the verbose syntax.

In [17]:
pat = r'''   # identify URLs within a multiline string
(https?|ftp) # make sure we find a resource type
         :// # needs to be followed by colon-slash-slash
  [^ ,\t\n]+ # stuff other than comma, space, tab, newline
'''

In [18]:
s = '''The URL for my site is: http://example.com/mydoc.html.  You
might also enjoy ftp://example.org/index.html for a good place
to download files. A URL might end its line:

https://example.net/secure
'''

show(pat, s, re.VERBOSE)