In [1]:
%%HTML
<style>
.container { width: 100% }
</style>

# Regular Expressions in Python (A Short Tutorial)

This is a tutorial showing how regular expressions are supported in *Python*.  
The assumption is that the reader already has a grasp of the concept of 
[regular expressions](https://en.wikipedia.org/wiki/Regular_expression) as it is taught in lectures 
on formal languages, for example in 
[Formal Languages and Their Application](https://github.com/karlstroetmann/Formal-Languages/blob/master/Lecture-Notes/formal-languages.pdf), but does not know how regular expressions are supported in *Python*.

In *Python*, regular expressions are not part of the core language but are rather supported by the module `re`.  This module is part of the *Python* standard library and therefore there is no need 
to install this module.  The full documentation of this module can be found at
[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html).

In [2]:
import re

Regular expressions are strings that describe <em style="color:blue">languages</em>, where a 
<em style="color:blue">language</em> is defined as a <em style="color:blue" a>set of strings</em>. 
In the following, let us assume that $\Sigma$ is the set of all Unicode characters and $\Sigma^*$ is the set 
of strings consisting of Unicode characters.  We will define the set $\textrm{RegExp}$ of regular expressions inductively.
In order to define the meaning of a regular expression $r$ we define a function 
$$\mathcal{L}:\textrm{RegExp} \rightarrow 2^{\Sigma^*} $$
such that $\mathcal{L}(r)$ is the <em style="color:blue">language</em> specified by the regular expression $r$.

In order to demonstrate how regular expressions work we will use the function `findall` from the module 
`re`.  This function is called in the following way:
$$ \texttt{re.findall}(r, s, \textrm{flags}=0) $$
Here, the arguments are interpreted as follows:
- $r$ is a regular expression,
- $s$ is a string, and
- $\textrm{flags}$ is an optional argument of type `int` which is set to $0$ by default.
  This argument is useful to set flags that might be used to alter the interpretation of the regular 
  expression $r$. 
  For example, if the flag `re.IGNORECASE` is set, then the search performed by `findall` is not  
  case sensitive.
  
The function `findall` will return a list of those non-overlapping substrings of the string $s$ that 
match the regular expression $r$.  In the following example, the regular expression $r$ searches
for the letter `a` and since the string $s$ contains the character `a` two times, `findall` returns a 
list with two occurrences of `a`:

In [3]:
re.findall('a', 'abcabcABC')

['a', 'a']

In [4]:
re.findall('a', 'abcabcABC', re.IGNORECASE)

['a', 'a', 'A']

To begin our definition of the set $\textrm{RegExp}$ of regular expressions, we first have to define
the set $\textrm{MetaChars}$ of all <em style="color:blue">meta-characters</em>:
```
    MetaChars := { '.', '^', '$', '*', '+', '?', '{', '}', '[', ']', '\', '|', '(', ')' }
```
These characters are used as operator symbols or part of operator symbols inside of regular
expressions.

Now we can start our inductive definition of regular expressions:
- Any Unicode character $c$ such that $c \not\in \textrm{MetaChars}$ is a regular expression.
  The regular expressions $c$ matches the character $c$, i.e. we have
  $$ \mathcal{L}(c) = \{ c \}. $$
- If $c$ is a meta character, i.e. we have $c \in \textrm{MetaChars}$, then the string $\backslash c$
  is a regular expression matching the meta-character $c$, i.e. we have
  $$ \mathcal{L}(\backslash c) = \{ c \}. $$

In [5]:
re.findall('a', 'abaa')

['a', 'a', 'a']

In the following example we have to use <em style="color:blue">raw strings</em> in order to prevent
the backlash character to be mistaken as an <em style="color:blue">escape sequence</em>.  A string is a 
<em style="color:blue">raw string</em> if the opening quote character is preceded with the character
`r`.

In [6]:
re.findall(r'\+', '+-+')

['+', '+']

## Concatenation

The next rule shows how regular expressions can be <em style="color:blue">concatenated</em>:
- If $r_1$ and $r_2$ are regular expressions, then $r_1r_2$ is a regular expression.  This
  regular expression matches any string $s$ that can be split into two substrings $s_1$ and $s_2$ 
  such that $r_1$ matches $s_1$ and $r_2$ matches $s_2$.  Formally, we have
  $$\mathcal{L}(r_1r_2) := 
    \bigl\{ s_1s_2 \mid s_1 \in \mathcal{L}(r_1) \wedge s_2 \in \mathcal{L}(r_2) \bigr\}.
  $$
  This way, we can now find words with regular expressions.

In [7]:
re.findall(r'the', 'The horse, the dog, and the cat.', flags=re.IGNORECASE)

['The', 'the', 'the']

## Choice

Regular expressions provide the operator `|` that can be used to choose between 
<em style="color:blue">alternatives:</em>
- If $r_1$ and $r_2$ are regular expressions, then $r_1|r_2$ is a regular expression.  This
  regular expression matches any string $s$ that can is matched by either $r_1$ or $r_2$.
  Formally, we have
  $$\mathcal{L}(r_1|r_2) := \mathcal{L}(r_1) \cup \mathcal{L}(r_2).  $$

In [8]:
re.findall(r'The|the', 'The horse, the dog, and the cat.')

['The', 'the', 'the']

## Quantifiers

The most interesting regular expression operators are the <em style="color:blue">quantifiers</em>.
The official documentation calls them <em style="color:blue">repetition qualifiers</em> but in this notebook 
they are called *quantifiers*, since this is shorter.  Syntactically, *quantifiers* are 
<em style="color:blue">postfix operators</em>.
- If $r$ is a regular expressions, then $r+$ is a regular expression.  This
  regular expression matches any string $s$ that can be split into a list on $n$ substrings $s_1$, 
  $s_2$, $\cdots$, $s_n$ such that $r$ matches $s_i$ for all $i \in \{1,\cdots,n\}$.  
  Formally, we have
  $$\mathcal{L}(r+) := 
    \Bigl\{ s \Bigm| \exists n \in \mathbb{N}: \bigl(n \geq 1 \wedge 
            \exists s_1,\cdots,s_n : (s_1 \cdots s_n = s \wedge 
             \forall i \in \{1,\cdots, n\}:  s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$
  Informally, $r+$ matches $r$ any positive number of times.

In [9]:
re.findall(r'a+', 'abaabaaaba.')

['a', 'aa', 'aaa', 'a']

- If $r$ is a regular expressions, then $r*$ is a regular expression.  This
  regular expression matches either the empty string or any string $s$ that can be split into a list on $n$ substrings $s_1$, 
  $s_2$, $\cdots$, $s_n$ such that $r$ matches $s_i$ for all $i \in \{1,\cdots,n\}$.  
  Formally, we have
  $$\mathcal{L}(r*) := \bigl\{ \texttt{''} \bigr\} \cup
    \Bigl\{ s \Bigm| \exists n \in \mathbb{N}: \bigl(n \geq 1 \wedge 
            \exists s_1,\cdots,s_n : (s_1 \cdots s_n = s \wedge 
             \forall i \in \{1,\cdots, n\}:  s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$
  
  Informally, $r*$ matches $r$ any number of times, including zero times.  Therefore, in the following example the result also contains various empty strings.  For example, In the string `'abaabaaaba'` the regular expression `a*` will find an empty string at the beginning of each occurrence of the character `'b'`.  The final occurrence of the empty string is found at the end of the string:

In [10]:
re.findall(r'a*', 'abaabaaaba')

['a', '', 'aa', '', 'aaa', '', 'a', '']

- If $r$ is a regular expressions, then $r?$ is a regular expression.  This
  regular expression matches either the empty string or any string $s$ that is matched by $r$.  Formally we have
  $$\mathcal{L}(r?) := \bigl\{ \texttt{''} \bigr\} \cup \mathcal{L}(r?). $$
  Informally, $r?$ matches $r$ at most one times but also zero times.  Therefore, in the    
  following example the result also contains two empty strings.  One of these is found at the beginning of the 
  character `'b'`, the second is found at the end of the string.

In [11]:
re.findall(r'a?', 'abaa')

['a', '', 'a', 'a', '']

- If $r$ is a regular expressions and $m,n\in\mathbb{N}$ such that $m \leq n$, then $r\{m,n\}$ is a 
  regular expression.  This regular expression matches any number $k$ of repetitions of $r$ such that   $m \leq k \leq n$.
  Formally, we have
  $$\mathcal{L}(r\{m,n\})
    \Bigl\{ s \mid \exists k \in \mathbb{N}: \bigl(m \leq k \leq n \wedge 
            \exists s_1,\cdots,s_k : (s_1 \cdots s_k = s \wedge 
             \forall i \in \{1,\cdots, k\}:  s_i \in \mathcal{L}(r)\bigr)  
    \Bigr\}.
  $$
  Informally, $r\{m,n\}$ matches $r$ between $m$ and $n$ times.

In [12]:
re.findall(r'a{2,3}', 'aaaa')

['aaa']

If $r$ is a regular expressions and $n\in\mathbb{N}$, then $r\{n\}$ is a regular expression.  This regular expression matches exactly $n$ repetitions of $r$.  Formally, we have
  $$\mathcal{L}(r\{n\}) = \mathcal{L}(r\{n,n\}).$$

In [13]:
re.findall(r'a{2}', 'aabaaaba')

['aa', 'aa']

If $r$ is a regular expressions and $n\in\mathbb{N}$, then $r\{,n\}$ is a regular expression.  This regular expression matches up to $n$ repetitions of $r$.  Formally, we have
  $$\mathcal{L}(r\{,n\}) = \mathcal{L}(r\{0,n\}).$$

In [14]:
re.findall(r'a{,2}', 'aabaaaba')

['aa', '', 'aa', 'a', '', 'a', '']

If $r$ is a regular expressions and $n\in\mathbb{N}$, then $r\{n,\}$ is a regular expression.  This regular expression matches $n$ or more repetitions of $r$.  Formally, we have
  $$\mathcal{L}(r\{n,\}) = \mathcal{L}(r\{n\}r*).$$

In [15]:
re.findall(r'a{2,}', 'aabaaaba')

['aa', 'aaa']

## Non-Greedy Quantifiers

The quantifiers `?`, `+`, `*`, `{m,n}`, `{n}`, `{,n}`, and `{n,}` are <em style="color:blue">greedy</em>, i.e. they 
match the longest possible substrings.  Suffixing these operators with the character `?` makes them 
<em style="color:blue">non-greedy</em>.  For example, the regular expression `a{2,3}?` matches either 
two occurrences of the character `a` or three occurrences but will prefer to match only two characters.  Hence, the regular expression `a{2,3}?` will find two matches in the string `'aaaa'`, while the regular expression 
`a{2,3}?` will find only one match. 

In [16]:
re.findall(r'a{2,3}?', 'aaaa'), re.findall(r'a{2,3}', 'aaaa')

(['aa', 'aa'], ['aaa'])

## Character Classes

In order to match a set of characters we can use a <em style="color:blue">character class</em>.
If $c_1$, $\cdots$, $c_n$ are Unicode characters, then $[c_1\cdots c_n]$ is a regular expression that 
matches any of the characters from the set $\{c_1,\cdots,c_n\}$:
$$ \mathcal{L}\bigl([c_1\cdots c_n]\bigr) := \{ c_1, \cdots, c_n \} $$

In [17]:
re.findall(r'[abc]+', 'abcdcba')

['abc', 'cba']

Character classes can also contain <em style="color:blue">ranges</em>.  Syntactically, a range has the form
$c_1\texttt{-}c_2$, where $c_1$ and $c_2$ are Unicode characters.
For example, the regular expression `[0-9]` contains the range `0-9` and matches any decimal digit.  To find all numbers embedded in a string we could use the regular expression `[1-9][0-9]*|[0-9]`.  This regular expression matches either a single digit or a string that starts with a non-zero digit and is followed by any number of digits.

In [18]:
re.findall(r'[1-9][0-9]*|[0-9]', '11 abc 12 2345 007 42 0')

['11', '12', '2345', '0', '0', '7', '42', '0']

Note that the next example looks quite similar but gives a different result:

In [19]:
re.findall(r'[0-9]|[1-9][0-9]*', '11 abc 12 2345 007 42 0')

['1', '1', '1', '2', '2', '3', '4', '5', '0', '0', '7', '4', '2', '0']

Here, the regular expression starts with the alternative `[0-9]`, which matches any single digit. 
So once a digit is found, the resulting substring is returned and the search starts again.  Therefore, if this regular expression is used in `findall`, it will only return a list of single digits.

There are some predefined character classes:
- `\d` matches any digit.
- `\D` matches any non-digit character.
- `\s` matches any whitespace character.
- `\S` matches any non-whitespace character.
- `\w` matches any alphanumeric character.
  If we would use only <font style="font-variant: small-caps">Ascii</font> characters this would 
  be equivalent to the character class `[0-9a-zA-Z_]`.
- `\W` matches any non-alphanumeric character.
- `\b` matches at a word boundary.  The string that is matched is the empty string.
- `\B` matches at any place that is **not** a word boundary.  
  Again, the string that is matched is the empty string.

These escape sequences can also be used inside of square brackets.

In [20]:
re.findall(r'[\dabc]+', '11 abc12 1a2 2b3c4d5')

['11', 'abc12', '1a2', '2b3c4', '5']

Character classes can be negated if the first character after the opening `[` is the character `^`.
For example, `[^abc]` matches any character that is different from `a`, `b`, or `c`.

In [21]:
re.findall(r'[^abc]+', 'axyzbuvwchij')

['xyz', 'uvw', 'hij']

In [22]:
len(re.findall(r'\b\w+\b', 'This is some text where we want to count the words.'))

11

The following regular expression uses the character class `\b` to isolate numbers.  Note that we had to use parentheses since concatenation of regular expressions binds stronger than the choice operator `|`.

In [23]:
re.findall(r'\b([0-9]|[1-9][0-9]*)\b', '11 abc 12 2345 007 42 0')

['11', '12', '2345', '42', '0']

## Negated Character Classes

If the first character of a character class is the character `^`, then the character class is 
<em style="color:blue">negated</em> and matches ll characters that are **not** listed in the character class.
For example, the regular expression `[^a-zA-Z\s]+` matches any non-empty string of character that do not contain
any letter or any whitespace.

In [24]:
re.findall(r'[^a-zA-Z\s]+', 'One horse, 3 guns, and 2 cowboys.')

[',', '3', ',', '2', '.']

## Grouping

If $r$ is a regular expression, then $(r)$ is a regular expression describing the same language as 
$r$.  There are two reasons for using parentheses:
- Parentheses can be used to override the precedence of an operator.
  This concept is the same as in programming languages.  For example, the regular expression `ab+`
  matches the character `a` followed by any positive number of occurrences of the character `b` because
  the precedence of a quantifiers is higher than the precedence of concatenation of regular expressions. 
  However, `(ab)+` matches the strings `ab`, `abab`, `ababab`, and so on.
- Parentheses can be used for <em style="color:blue">back-references</em> because inside 
  a regular expression we can refer to the substring matched by a regular expression enclosed in a pair of
  parentheses using the syntax $\backslash n$ where $n \in \{1,\cdots,9\}$.
  Here, $\backslash n$ refers to the $n$th parenthesized <em style="color:blue">group</em> in the regular 
  expression, where a group is defined as any part of the regular expression enclosed in parentheses.
  Counting starts with the left parentheses,  For example, the regular expression
  ```
  (a(b|c)*d)?ef(gh)+
  ```
  has three groups:
  1. `(a(bc)*d)` is the first group,
  2. `(b|c)` is the second group, and
  3. `(gh)` is the third group.
  
  For example, if we want to recognize a string that starts with a number followed by some white space and then
  followed by the <b>same</b> number we can use the regular expression `(\d+)\w+\1`.

In [25]:
re.findall(r'(\d+)\s+\1', '12 12 23 23 17 18')

['12', '23']

In general, the expression $\backslash n$ refers to the string matched in the $n$-th group of the regular expression that it occurs in.

## The Dot

The regular expression `.` matches any character except the newline.  For example, `c.*?t` matches any string that starts with the character `c` and ends with the character `t` and does not contain any newline.  If we are using the non-greedy version of the quantifier `*`, we can find all such words in the string below.

In [26]:
re.findall(r'c.*?t', 'ct cat caat cut')

['ct', 'cat', 'caat', 'cut']

The dot `.` does not have any special meaning when used inside a character range.  Hence, the regular expression
`[.]` matches only the character `.`.

## Start and End of a Line

The regular expression `^` matches at the start of a string.  If we set the flag `re.MULTILINE`, which we 
will usually do when working with this regular expression, then `^` also matches at the beginning of each line,
i.e. it matches after every newline character.

Similarly, the regular expression `$` matches at the end of a string.  If we set the flag `re.MULTILINE`, which we 
will usually do when working with this regular expression, then `$` also matches at the end of each line,
i.e. it matches before every newline character.

In [27]:
data = \
'''
This is a text containing five lines, two of which are empty.
This is the second line,
and this is the third line.
'''
re.findall(r'^.*$', data, flags=re.MULTILINE)

['',
 'This is a text containing five lines, two of which are empty.',
 'This is the second line,',
 'and this is the third line.',
 '']

## Examples

In order to have some strings to play with, let us read the file `alice.txt`, which contains the book
[Alice's Adventures in Wonderland](https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland) written by 
[Lewis Carroll](https://en.wikipedia.org/wiki/Lewis_Carroll).

In [28]:
with open('alice.txt', 'r') as f:
    text = f.read()

In [29]:
print(text[:1020])


                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

  There was nothing so VERY remarkable in that; nor did Alice
think it so VERY much out of the way to hear the Rabbit say to
itself, `Oh dear!  Oh dear!  I shall be late!'

How many non-empty lines does this story have?

In [30]:
len(re.findall(r'^.*[^\s].*?$', text, flags=re.MULTILINE))

2725

Next, let us check, whether this text is suitable for minors.  In order to do so we search for all four
letter words that start with either `d`, `f` or `s` and end with `k` or `t`.

In [31]:
set(re.findall(r'\b[dfs]\w{2}[kt]\b', text, flags=re.IGNORECASE))

{'Duck',
 'FOOT',
 'dark',
 'desk',
 'duck',
 'fact',
 'fast',
 'feet',
 'felt',
 'flat',
 'foot',
 'fork',
 'salt',
 'sent',
 'shut',
 'sink',
 'soft',
 'sort',
 'spot',
 'suet',
 'suit'}

How many words are in this text and how many different words are used?

In [32]:
L = re.findall(r'\b\w+\b', text.lower())
print(f'There are {len(L)} words in this book and {len(set(L))} different words.')

There are 27344 words in this book and 2579 different words.
