## What are regular expressions?

Regular expressions are a generalized way to match patterns with sequences of characters. They define a search pattern, mainly for use in pattern matching with strings, or character matching, i.e. “find and replace” like operations.

These expressions are largely cross-linguistic:
Except for some minor variations, a *regular expression* can be used in
programming languages such as Python, Perl or Java, or at least very similar. 
Also text editors offer support for regular expressions as well as, e.g. also MS Word.




## Why regular expressions?

So far we have learned a few ways to search for a substring in a string.

* The in operator checks whether a substring occurs in the string:

In [None]:
firstnames = ['Astrid', 'Ines', 'Christoph', 'Markus', 'Çınar', 
              'Đželila', 'Niklas', 'Anna', 'Stefanie', 'Raphael', 
              'Anna-Lena', 'Silvia', 'Julian', 'Simon', 'Katharina', 
              'Michael', 'Dominik', 'Maria', 'Kevin', 'Bianca', 
              'Thomas', 'Nora', 'Manuel', 'Selina', 'Gabriel', 
              'Daniel', 'Thomas', 'Nina', 'Michael', 'Fabio', 
              'Theresa', 'Manuel', 'Carina', 'Philipp', 'Lukas', 
              'Wolfgang', 'Anna', 'Doris', 'Thomas', 'Muhammed', 
              'Christoph', 'Lisa-Marie', 'Jessica', 'Maria', 
              'Thomas', 'Florian', 'Martin', 'Anna', 'Oliver', 
              'Gregor', 'Helmut', 'Florian', 'Matteo', 'David', 
              'Marlene', 'Vanessa', 'Lea', 'Jan', 'Béla', 'Verena', 
              'Manuel', 'Björn', 'Tobias', 'Denise', 'Emma', 'Lukas', 
              'Sarah', 'Oliver', 'Janine', 'Manuel', 'Georg', 'Lorenz', 
              'Verena', 'Caroline', 'Laura', 'Felix', 'Simon', 'Lea', 
              'Peter', 'Sandra', 'Julia', 'Sophie', 'Jacqueline', 
              'Nina', 'Sebastian', 'David', 'Matthias', 'Patrick', 
              'Selina', 'Fabian', 'Daniel', 'Sabine', 'Josef', 'Lisa', 
              'Carina', 'Florian', 'Fabian', 'Viktoria', 'Christoph', 
              'Emilia']


In [None]:
for firstname in firstnames:
    if 'ie' in firstname:
        print(firstname)

* test ```startswith()``` und ```endswith()``` to see whether a string starts or ends with a substring:

In [None]:
[firstname for firstname in firstnames if firstname.startswith('A')]

In [None]:
[firstname for firstname in firstnames if firstname.endswith('o')]

Regular expressions greatly extend the possibilities of searching for a pattern in a string. They are also not Python-specific, but as mentioned earlier, are available in most programming languages and many text editors.

A regular expression is a pattern written in a special syntax that is applied to a string.

Here's a tip right away: More complex regular expressions are often difficult to understand. Experience shows that the use of interactive or even visualizing regex testers helps here. Here are some recommendations:
* my favorite : https://regex101.com/
* https://pythex.org/ 
* https://www.debuggex.com/

## Regex in Python

In Python, regular expressions are provided via the `re` module from the standard library. Alternatively, there is the somewhat newer `regex` module, which we would have to install via pip or conda first. Here we use the `re` module which is already available everywhere.

In [None]:
import re

The re module provides a number of functions, including the `search()` function, which can be used to search for a pattern in a
a string. `search()` expects two arguments: the pattern to search for and the string to which the pattern applies.

If the pattern is found, `search()` returns a `match` object:

In [None]:
re.search('de', 'abcdef')

If the pattern is not found, `search()` returns `None` instead of a ``Match`` object: 

In [None]:
print(re.search('xyz', 'abcdef'))

However, we could have done the same thing with a simple 

```python
'xyz' in 'abcdef'.
``` 

could have achieved the same thing:

In [None]:
'xyz' in 'abcdef'

The power of regular expressions only comes from the ability to define more complex patterns.

## Patterns

### Test for beginning of string
Regular expressions use the `^` character (top left of the keyboard) to mark the beginning of the string. If the first character of the pattern is a '^', then the pattern following it must occur at the beginning of the string:

In [None]:
re.search('^ab', 'abc')

returns a match object, 

In [None]:
re.search('^ab', 'cab')

on the other hand, does not find the pattern: ``ab`` occurs in the string, but not at the beginning of the string.

### Test for end of string

Regular expressions use the `$` character to mark the end of the string:

In [None]:
re.search('z$', 'xyz')

does not return a match because ``z`` is not the last character of the string.


<div class="alert alert-block alert-info">
<b>Exercise Regex-1</b>
<p>Things to think about: Which string does the following pattern match?</p>
</div>

In [None]:
re.search('^abc$', '')

### Any characters and quantizers

The dot stands for any character in a regular expression.

In [None]:
re.search('v.n', 'Guido van Rossum')

In [None]:
re.search('v.n', 'Anton von Webern')

thus returns a hit in both cases, because the pattern ``v.n`` matches both ``van`` and ``of``.

Each character can be combined with a *quantor* that specifies how often the character must occur at that position. You should know the following quantifiers:

* `*`: The asterisk character means 0 to any number of repetitions.
* `+`: The plus sign means one or more repetitions
* `?`: The question mark stands for no or one repetition

#### Any number of repetitions (*).

A quantifier can be combined with any character. Here we combine the quantifier `*` with the character `a`, which means: *At this position, the character `a` can appear no times, once or any number of times.*

In [None]:
import re

for s in ['cl', 'col', 'cool', 'coool']:
    if re.search('c*l', s):
        print(s)

In [None]:
re.search('G.*v', 'Guido van Rossum')

The pattern matches `Guido v` because any number of characters are allowed between the `G` and the `v`. 
However, any number includes *none*, as we can see from this example:

In [None]:
re.search('R.*o', 'Guido van Rossum')

although the `o` in `Rossum` immediately follows the `R`, so there is no other character in between, our pattern fits. 

#### One or more repetitions (+).
If we use the quantifier `+` (1 or more repetitions) instead of `*`, the pattern for `Ro` no longer fits, because ``+`` requires at least 1 character at this position:

In [None]:
print(re.search('R.+o', 'Guido van Rossum'))

This pattern, on the other hand, is still found:

In [None]:
re.search('G.+v', 'Guido van Rossum')

#### None or a repetition (?)
The following example (`R.?s`) fits, because the question mark stands for none or a repetition, on e.g. `Ros` or `Rus`, but also on `Rs`, but not on `Roos`.

In [None]:
for s in ['Ros', 'Rus', 'Rxs', 'Rs', 'Roos']:
    if re.search('R.?s', s):
        print(s)

So here we are looking for substrings where there are exactly two (arbitrary) characters between ``R`` and ``s``.

#### An interval of repetitions
It is also possible to specify in the pattern that, for example, 1, 2 or 3 repetitions of the character are allowed. For this purpose, two numbers (min, max) are written between the curved brackets:

In [None]:
for s in ['Rs', 'Ras', 'Rees', 'Riiis', 'Roooos']:
    if re.search('R.{1,3}s', s):
        print(s)

Once again as a reminder: quantifiers can be used not only together with the placeholder character `.`, but with all characters or character classes:

In [None]:
for s in ['Pr', 'Par', 'Pur', 'Paar', 'Paur', 'Paaar']:
    if re.search('a+', s):
        print(s)

**Exercise**: try the above example with other quantifiers, too!

## Patterns are greedy
A regex engine always tries to find the widest pattern. This becomes clearer if we use `re.findall()` instead of `re.search()` in the next example. This function does not return a match object, but a list of all substrings on which the pattern matches, but the rules for the pattern are the same.

In [None]:
re.findall('G.*o', 'Guido van Rossum')

The partial string found is not, as one might expect, `Guido`, but the longest string to which the pattern matches: `Guido van Ro`. 
Since this is not always desired, we can mark a quantifier as *non-greedy* by a trailing question mark:

In [None]:
re.findall('G.*?o', 'Guido van Rossum')

So the ``?`` after the quantifier ``*`` means: Break off as soon as you have found a pattern (i.e. do not try to find a broader pattern).

To demonstrate the functionality of `findall()`, another example:

In [None]:
re.findall('.ll.', 'the snow is falling really extraordinarilly upwards ')

### Ignore upper and lower case
Normally, regular expressions are case-sensitive:

In [None]:
re.findall('a.{0,2}', 'Antoinette asked to be alone.')

If we do not want this distinction, we must call the regex function with the flag re.I (or: re.IGNORECASE):

In [None]:
re.findall('a.{0,2}', 'Antoinette asked to be alone.', re.I)

Now the pattern fits `'Ant'` as well.

## Character classes
So far, we have only defined patterns that use either a specific character or a wildcard for all characters. However, it would often be useful if we could test for only certain characters (i.e., a subset of characters), such as all vowels. Regular expressions provide the concept of *character class* for this purpose. 
All characters inside square brackets are considered members of the character class. Thus, the expression `[aeiou]` matches any vowel.

In [None]:
re.findall('[a-z]', 'Antoinette asked to be alone.', re.I)

This also works for digits: `[0-9]`

In [None]:
re.findall('[0-9]', 'Antonia has 1 horse and 2 cats.')

<div class="alert alert-block alert-info">
<b>Exercise Regex-2</b>
<p>
Rewrite the following expression so that the result is `['12', '3']`:
</p>
</div>

In [None]:
re.findall('[0-9]', 'Anna has 12 cows horse and 3 dogs.')

In [None]:
re.findall('[a-zA-Z]', 'Anton prefers to work alone.')

### Predefined character classes

Some character classes are predefined and can make our life a lot easier:

   * `\s` stands for whitespace (spaces, tabs,
     line breaks, etc.)
   * `\S` matches any non-whitespace character. (This
     Capitalization inversion works
     for all character classes presented here)
   * `\b` - allows you to perform a “whole words only” searches (word boundary)
   * `\d` stands for a `decimal digit`, a digit in any
     Character system defined in Unicode (See
     (http://www.fileformat.info/info/unicode/category/Nd/list.htm)).
   * `\w` stands for a "word character". In ASCII equivalent
     the [a-zA-Z0-9_]

Since some backslash combinations (like `\b`) already have a special meaning at string level (remember ``\n``?), when used in a regular expression we must either escape them with a second backslash or write the expression as a *raw string* (denoted by putting an `r` in front of the string). This example doesn't work:

In [None]:
re.findall('\b[A-Z]\w*', 'Anton has 1 horse and 2 cats.')

This, on the other hand, is:

In [None]:
re.findall('\\b[A-Z]\w*', 'Anton has 1 dog called Popito and 2 cats, Toulouse and Marie.')

Basically, you should make a habit of always writing patterns as *raw strings* by putting an `r` in front of the expression:

In [None]:
re.findall(r'\b[A-Z]\w*', 'Anton has 1 dog called Popito and 2 cats, Toulouse and Marie.')

We can apply this new knowledge, for example, to find all words in which two vowels follow each other:

In [None]:
s = "The green sea was the colour of her hair and coat clear as rain. "
re.findall(r'\b\w*[aeiou]{2}\w*', s)

Once again as a reminder: `\b` stands for word boundary, which we need here to separate into words; `\w` for a word sign (``[a-zA-Z0-9_]``), of which any number may occur in the word before and after the vowel pair we are looking for.

### Alternatives

A character class always represents a single character in the string. If we want to search for one of several character combinations, we need **alternatives** instead of character classes:

In [None]:
s = 'Anton has 1 horse, 2 dogs and 3 cats'
re.findall(r'cats|horse|dogs', s)

If we want to search independently of the singular or plural, we can use the ``?`` quantifier. Remember: ``?`` means that the character in front of it cannot occur once or not at all. Horses?``
thus matches ```horse?``` and ```horses?```. This allows us to make our pattern more flexible:

In [None]:
s = 'Anton has 1 horse, 2 dogs and 3 cats'
re.findall(r'cats?|horses?|dogs?', s)

We can even have the number of animals included if we assume that the number precedes the animal and the number is followed by one or more whitespace characters:

In [None]:
re.findall(r'\d+\s+horses?|\d+\s*dogs?|\d+\s*cats?', s)

### Further functions of the re module
So far we have only seen two functions of the re module: `search()` and `findall()`. But there are some other very useful functions.

### match()

`match()` behaves like `search()`, but the pattern is always searched for at the beginning of the string. A `match('abc', s)` corresponds to a `search('^abc', s)`

In [None]:
re.match('ab', 'abcdef')

In [None]:
re.match('ab', '-abcdef')

### split()
We've already seen the String object's `split()` method:

In [None]:
'1,2,3,4'.split(',')

`re` provides a similar `split()` function, where we specify the delimiter (or delimiter string) as
can specify a regular expression. This split is much more powerful as a result.

For example, if we want to break up text into sentences, we have to separate the text at each punctuation mark: `.!?`. It's easy with `re.split()`:

In [None]:
with open('data/NASA.txt') as fh:
    text = fh.read()
sentences = re.split(r'[.?!]\s*', text)
print(len(sentences))

To clarify, the ``\s*`` at the end of the pattern ensures that whitespace is stripped between sentences.

*Note*: This process of breaking down text into small units, such as sentences or words, is called tokenization. This can be done with regular expressions for simple cases, but specialized functions such as those provided by packages for processing natural languages (e.g. NLTK) are better.

### sub()

This function corresponds to the `replace()` method of the String object. However, the substring to be replaced can be a pattern here.
For example, we could normalize all whitespace in a string to a space to convert multiple spaces, but also tabs, newlines, etc. to a single space.

In [None]:
with open('data/NASA.txt') as fh:
    text = fh.read()
text = re.sub('\s+', ' ', text)    
sentences = re.split(r'[\.\?\!]\s*', text)
print(sentences[:2])

For example, now we can calculate the mean sentence length (in characters):

In [None]:
# Just in case, we remove 0-length sentences
sentences = [s for s in sentences if len(s) > 0]
sum([len(s) for s in sentences]) / len(sentences)

Or the average number of words per sentence, the shortest and the longest sentence:

In [None]:
words_per_sentence = [len(s.split()) for s in sentences]
print('Average number of words per sentence:', sum(words_per_sentence) / len(sentences))
print('The longest sentence has {} words, the shortest {}'.format(max(words_per_sentence), min(words_per_sentence)))