# Regular expressions: substitution and split

by Koenraad De Smedt at UiB

---
This tutorial is a continuation of the one about regex search. It demonstrates the following:

1.  Substitution of matches
2.  Anchors
3.  Lookbehind and lookahead
4.  Splitting of strings at regex matches
5.  Use of flags and character classes.

These are basic techniques for manipulating string patterns. There are some more techniques which are not demonstrated here. If you want to know more about regex in Python, see the [documentation](https://docs.python.org/3/library/re.html) and the [Python regular expression howto](https://docs.python.org/3/howto/regex.html).

---

In [None]:
import re

juliet = '''My bounty is as boundless as the sea,
My love as deep; the more I give to thee,
The more I have, for both are infinite.'''

## Substitution

The `re.sub` function does substitutions for parts that match a regular expression. By default, all matches are replaced.

In [None]:
print(re.sub('My', 'YOUR', juliet))

A number given as third argument will limit the number of substitutions.

In [None]:
print(re.sub('My', 'YOUR', juliet, 1))

If the dot needs to match newlines, the flag `re.DOTALL` must be specified. As always, the longest match is used.

In [None]:
print(re.sub('sea.*', 'ocean.', juliet, flags=re.DOTALL))

### Substitutions for groups

Groups marked with parentheses in the search pattern can be referred to in the substitution by means of indices with double backslash, such as `\\1`, `\\2`, and so on. The double backslash is necessary because of interference with the normal use of \ to escape characters in the substitution.


In [None]:
print(re.sub('as (\w+) as', 'more \\1 than', juliet))

Alternatively, use a [raw string preceded by `r`](https://docs.python.org/3/library/re.html#raw-string-notation) and use a single backslash.

In [None]:
print(re.sub('as (\w+) as', r'more \1 than', juliet))

The following illustrates how groups can be swapped.

In [None]:
print(re.sub('(\d+) (EUR|NOK|USD)', r'\2 \1', '1100 NOK is 100 EUR'))

### Anchors

One can use `$` (dollar) to *anchor* the regex at the end of the string. The following deletes the final word followed by any final non-word characters.

In [None]:
print(re.sub('\W\w+\W*$', '', juliet))

Similarly, a `^` (caret or circumflex) is an anchor at the beginning of the string.

In [None]:
print(re.sub('^My \w+', 'YOUR GAIN', juliet))

### Lookbehind and lookahead (optional)

Substitution is performed only on non-overlapping patterns. Consider the following simplified rule for intervocalic voicing of a fricative. After matching `'ofi'`, this part of the string has been consumed, so that `'ifa'` will not match.

In [None]:
re.sub('([aio])f([aio])', r'\1v\2', 'xofifan')

A possible workaround is looking for patterns before and/or after a match, without actually making them part of the match. In the following, `?<=` looks behind to a left context and `?=` looks ahead to a right context, while only `f` is matched and replaced.

In [None]:
re.sub('(?<=[aio])f(?=[aio])', r'v', 'xofifan')

### Flags

Normally RE operations are case-sensitive. Adding the `re.I` flag ignores case in the matching.

In [None]:
print(re.sub('the', 'THE', juliet, flags=re.I))

Multiple flags can be combine with the vertical bar `|`

In [None]:
print(re.sub('SEA.*', 'ocean.', juliet, flags=re.I|re.DOTALL))

### Special characters

By now, it should be clear that several characters have special meanings in regular expressions:

> `. * + [ ] ( ) | ? ^ $ \`

Also the following have special meanings, see the [chapter on regular expressions by Jurafsky & Martin](https://web.stanford.edu/~jurafsky/slp3/2.pdf), but I will not give examples here.

> `{ }`

As mentioned before, special characters must be escaped in regex search strings if they are to be taken literally, as in the following, which replaces periods with exclamation marks.

In [None]:
print(re.sub('\.', '!', juliet))

## Split

A string can be split with a given regular expression. This results in a list of strings. The following splits a string at punctuation and/or newlines. Notice that since a matching period occurs at the end of the string, the split there results in an empty string.

In [None]:
print(re.split('[,;.!?\n]+', juliet))

### Exercises

1.  Split a text at non-word characters.
2.  Use a regexp to omit all vowels from a string.
3.  Expand English contractions with *n’t* in a text. For instance, replace *n’t* at the end of a word by a space followed by *not*, so that *don’t* becomes *do not* and *doesn’t* becomes *does not*. Note that some other contractions, such as *won’t* and *can’t*, must be handled separately.
4.  Revisit the notebook on *Palindromes*. Write a new version of a palindrome detector which disregards everything which is not a digit or letter (replace with nothing). Here is a little palindrome for testing:
`'''A man, a plan, a canoe, pasta, heros, rajahs, a coloratura, maps, snipe, percale, macaroni, a gag, a banana bag, a tan, a can, a tag, a banana bag again (or a camel), a crepe, pins, Spam, a rut, a Rolo, cash, a jar, sore hats, a peon, a canal - Panama!'''`
5.  (optional) Below is a string of nucleotides. Suppose we want to insert spaces between *codons*, i.e. groups of three nucleotides. The following does that, but also inserts an unwanted space at the end. Fix this detail by means of lookahead: only match a group of three if the group is followed by a character.

In [None]:
re.sub('(...)', r'\1 ', 'ATGTATAACGTGGCGTAAGCGTACGCTATAGCCTGA')