# Regular Expressions (Regex or Regexp)

To the uninitiated, the phrase _regular expression_ may sound as though it should mean "common phrase." In programming, however, it refers to string search queries and substitution commands following an agreed-upon set of rules for the encoding of patterns within the data (_regular_ deriving as it does from Latin _regula_ "rule").

A basic set of such rules is agreed upon across all regex-compliant software, but their _implementation_ differs a little from one context to the next. Of those differences, the ones you will run into most commonly are two:

- Whether the environment follows the original POSIX standard (and within it, whether it understands basic regular expressions [BRE] or extended regular expressions [ERE] as well), or Perl-compatible regular expressions (PCRE);
- Which special characters to escape (by prefixing them with a backslash).

Python's stock regex library, `re`, borrows a lot of its implementation from PCRE, but is not fully compliant with it. Some updates to behaviour were made as of Python 3.11, so some behaviour will differ if you need to be working with older libraries such as CLTK 1.5.

The official documentation of the `re` library may be found [here](https://docs.python.org/3/library/re.html). The Real Python regex tutorial starts [here](https://realpython.com/regex-python/), but that first part focuses rather heavily on `re.search()`, which you are unlikely to use much, so you'll want to move on to [part 2](https://realpython.com/regex-python-part-2/) fairly soon.

If you are a Mac or Linux user, you can use POSIX regular expressions on the command line using `grep`; be sure to consult the manual (`man grep`). Windows offers similar functionality in PowerShell using the `Select-String` command, and in `cmd` using `findstr`. But once you know, you can also use limited regular expressions within search and replace functions built into word processors and the like.

It would be inefficient to duplicate here the many great tutorials on Python `re`. Instead, I will use this notebook to make a few important points and give practical examples, which I may add to in future.

In [1]:
import os,glob
# re being a stock Python library, you'll never need to install it, just import:
import re
# We'll also load a Git wrapper so we can download a corpus to work with:
from git import Repo

Regular expressions are always applied to strings, not e.g. lists of tokens, so let's retrieve a document and open it as a single string:

In [2]:
# We'll repeat the conditional clone routine from corpora.ipynb so we'll have a corpus to work with:
remote = 'https://github.com/ECHOEProject/echoe.git'
local = 'echoe'
# Only clone if the target folder doesn't already exist:
if not(os.path.exists(local)):
    repo = Repo.clone_from(remote, local)
# Else, just update the working copy from remote:
else:
    repo = Repo(local)
    assert isinstance(repo, Repo)
    repo.remotes.origin.pull()
assert not repo.bare

# And we'll open an arbitrary document to run our queries on:
with open('echoe/plaintext/344.05.txt') as file:
    doc = ''
    lines = file.readlines()
    for line in lines:
        if ': ' in line:
            line = line.split(': ')[1].rstrip()
        doc = doc + ' ' + line

The `re` library offers different search functions, and you don't want to get stuck using the wrong one! You'll want to remember the following:

- `re.search()` only returns the first hit!
- `re.match()` only returns a hit if your pattern is matched at the start of the string to be searched!
- So for many NLP tasks, `re.findall()` is the go-to search function;
- But you may find you'll use `re.sub()` most of all, for in-place substitution of one pattern with another.

Whereas `re.search()` and `re.match()` return a `Match` object, i.e. an instance of the class `Match` with its own structure, `re.findall()` returns a normal Python list.

The document we have opened is a life of Saint Guthlac. Finding mentions of Guthlac is complicated by the fact that his name may be spelled with "þ" or "ð", so we'll use a __character class__, bounded by square brackets, to match either: `[þð]`. Also, his name may be inflected, so we'll use a predefined character class `\w` to match any literal (i.e. any letter), and a quantifier `*` to match any number of literals, yielding the expression `gu[þð]lac\w*`. Quantifiers come in different kinds: `*` matches zero or more (not one or more, that would be `+`), and it is __greedy__, meaning it will return the longest valid match. When you think about it, in a form "guðlaces" each of "guðlac", "guðlace", and "guðlaces" technically matches zero or more, but we want it to keep checking until it runs into something other than a literal, so "greedy" behaviour is what we want. If we wanted non-greedy behaviour, we could add `?`, as in `gu[þð]lac\w*?`, which would yield the same number of hits but they would all read either "guðlac" or "guþlac" in the list of hits.

In [3]:
result = re.findall('gu[ðþ]lac\w*', doc)
result

['guðlaces',
 'guðlaces',
 'guþlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guþlac',
 'guðlac',
 'guðlac',
 'guðlaces',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guþlac',
 'guðlaces',
 'guðlac',
 'guðlac',
 'guðlac',
 'guþlac',
 'guðlac',
 'guðlac',
 'guðlaces',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlac',
 'guðlace',
 'guðlac',
 'guðlaces',
 'guðlace',
 'guðlac',
 'guðlace',
 'guðlaces',
 'guðlac',
 'guðlac',
 'guþlac',
 'guðlac',
 'guþlaces']