# Regular expression search (1)

## for finding patterns in strings

by Koenraad De Smedt at UiB

---
Searching for patterns is a basic need in data science.

Strings can be searched and manipulated by means of a regular expression (*RE* or *regex*) which expresses a character pattern, such as, *a sequence of four alphanumeric characters followed by an exclamation mark or question mark.*

This notebook demonstrates how to:

1.   Search for a match to a pattern in a string
2.   Use constructions for alternatives, optionality and wildcards.

For more information on regular expressions and their use for NLP, read ➜ Jurafsky & Martin, *Speech and Language Processing, 3rd ed.*, Ch. 2: [Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf). However, note that there are a few system dependent conventions. Jurafsky & Martin use slashes to delimit regular expressions, but in Python they are simply strings.

See also the [documentation of Python regular expression operations](https://docs.python.org/3/library/re.html) and the [Python regular expression howto](https://docs.python.org/3/howto/regex.html).

---

In order to use regular expressions in Python, we import the `re` module. Let's also make an example string in which we will search various patterns.

In [None]:
import re

verses = '''My bounty is as boundless as the sea,
My love as deep; the more I give to thee,
The more I have, for both are infinite.'''

## Search for a pattern match

The `re.search` function searches for a regular expression in a string. It returns a match object containing the start and end points of the *first* match and the matching part of the string. If no match is found, the function returns `None`. The simplest regex is a string of regular characters. Note that search is case-sensitive.

In [None]:
print(re.search('My love', verses))
print(re.search('my love', verses))

### Alternatives in patterns

Some characters have a special meaning in a regular expression. The vertical bar `|` indicates alternatives (a disjunction). The following looks for either of two alternative strings and the search is successful as soon as one of them is found.

In [None]:
print(re.search('My love|My bounty', verses))

Use parentheses for making a group. The following makes a group of two alternatives. It is equivalent to the previous search, but more compact.

In [None]:
print(re.search('My (love|bounty)', verses))

Square brackets contain a list of alternative *characters*. The regex `'[Aa] '` matches an *A* or *a* followed by a space. The first match is returned.

In [None]:
print(re.search('[Mm]y ', verses))

When there is no match, `None` is returned.

In [None]:
print(re.search('[Xx]y ', verses)) # will not be found

Between square brackets, the hyphen indicates a *span* of alternative characters. The following looks for a character in the range *A* to and including *Z* or *a* to and including *z*.

In [None]:
print(re.search('[A-Za-z]y ', verses))

### Optionality and wildcards

The question mark indicates optional patterns. Note that the longest match is returned

In [None]:
print(re.search('(in)?finite', verses))
print(re.search('bound(less)? ?', verses))

A dot (period) in a regex stands for an arbitrary character, except a newline.

In [None]:
print(re.search('.s', verses))

The Kleene star `*` matches the previous expression 0 or more times.

In [None]:
print(re.search(' as ..* as ', verses))

The plus sign `+` matches the previous expression 1 or more times.

In [None]:
print(re.search(' as .+ as ', verses))

### Character classes

If you want to make sure to match words or non-words, use character classes.

`\w` is a word character (alphanumeric and underline).

`\W` is a non-word character (opposite of \w).

So, the following looks for two occurrences of *as*, with a delineated word in between them.

In [None]:
print(re.search('\Was\W\w+\Was\W', verses))

If you want to match a literal period, use `\.`

Note: In cases where the Python meaning of `\` in a string literal might interfere with its meaning in a regular expression, one might prefer a [*raw* string preceded by `r`](https://docs.python.org/3/library/re.html#raw-string-notation). In practice, this seems necessary only in substitutions (see the notebook on Regex substitution).

In [None]:
print(re.search('in.*\.', verses))

Between square brackets, the period and other special characters do not need to be escaped. The following finds all sequences containing one or more characters matching period, semicolon, exclamation mark or question mark.

In [None]:
punctuation = '[.;!?]+'
print(re.search(punctuation, verses))

### Matching newlines

Normally, the dot (wildcard) does not match newlines.

In [None]:
print(re.search(',.+,', verses))

If you want the period to match newlines anyway, use re.DOTALL as an extra argument. Note that the longest possible match is returned.

In [None]:
print(re.search(';.+\,', verses, re.DOTALL))

### Exercises

1.  Check if the text contains two vowels after each other.
2.  Check if the text contains a comma followed by a space or newline.