# Chapter 17: Regular expressions

In [None]:
from re import *

A regular expression (*regex*) is a concise way to define a set of strings.

As a programming technique, regexes became popular in the late 1960s, and they are just as useful today as they were on the original PDP-11 mainframes at Bell labs. Unfortunately for the student, many of the tools of this vintage use a painfully terse syntax. For example, a regular to find e-mail addresses in uppercase is:

In [1]:
#^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

Before delving into the elaborate and baroque syntax of regexes, I need to explain how to use simple regexes in Python. So, for the time being, a regex is a string.

A regex 'matches' another string if that other string starts with the regex. So, the regex "I am the very" will match any string that starts with "I am the very". It would not match "I am not the very model", or "I think I am the very" or "This example is stupid".

We can also 'search' for a regex in another string. The result of searching is the first occurence of the regex in the other string. The search result need not be at the beginning of the string. So, the regex "I am the very" would find "I am the very" and "He thinks I am the very model of excellence", but not "Huh?".

Here are the critical functions in the *re* package:

In [2]:
#compile(pattern)

takes the string pattern and returns a regex object that can be used elsewhere.

That regex object can then be used to search and match strings.
For example,

In [4]:
#isGeneral = compile("I am the very")
#isGeneral.search("He is the very")
#isGeneral.match("Who is to say if I am the very model of a model?")

match and search return a *MatchObject* if they find a hit, otherwise they return *None*.

Conveniently, all *MatchObjects* evaluate to *True*, while *None* evaluates to *False*. This means we can plop the result of a search or match into an *if* statement, and the body of the *if* will only be executed if the regex was found.

We can extract the occurence of the regex in the string using the group method of the *MatchObject*, which returns the match as a string.

Finally, we can find every occurence of a regex in a string with the *findall* method of *RegexObject*. 

In [5]:
def regexProcess(regex, string):
    #A convenience method to match, search, and findall of a given regex in
    #the specified string.
    print("")
    print("Examining {0:s} against {1:s}".format(string, regex.pattern))
    matches = regex.match(string)
    if(matches):
        print("match = {0:s}".format(matches.group()))
    searches = regex.search(string)
    if(searches):
        print("search = {0:s}".format(searches.group()))
    allOccurences = regex.findall(string)
    if(allOccurences):
        print("findall = {0:s}".format(allOccurences))
    
def basicRegexes():
    s1 = "I am the very"
    #s1 is the exact string the regex will match.
    s2 = "I am the very model of a modern major general"
    #s2 starts with s1, it should trigger both match and search.
    s3 = "I could be the very model"
    #s3 will not match; it contains letters not in s1.
    s4 = "Could it be that I am the very model?"
    #s4 contains s1, but doesn't start with it. It should be found by search,
    #but not match.
    s5 = "Is there no better example?"
    #This shouldn't be found by anything.
    s6 = "I am the very model of this and I am the very model of that."
    #This contains s1 twice. 
    isGeneral = compile("I am the very")
    for target in (s1, s2, s3, s4, s5, s6):
        regexProcess(isGeneral, target)
    
#Next, the line start and end characters: ^ and $
#Inside a regex, the ^ character indicates the beginning of a line.
#$ indicates the end of a line.
#A . matches any character. 

def lineLimits():
    re1 = compile("^aa")
    #Any string that starts with aa. This doesn't change match, since match
    #only looks at the beginning of the string. 
    re2 = compile("aa$")
    #Lines ending in aa
    re3 = compile("aa")
    #No constraints on where in the line aa is found. (match will still only
    #look at the beginning, of course.)
    re4 = compile("$aa")
    #This will never match anything - it looks for the end of the string,
    #followed by more characters. But there aren't more characters after
    #the end of a string.
    re5 = compile("^a.a")
    #matches a string that starts with a, then any character, then another a.
    re6 = compile("....")
    #matches strings that are at least four characters long.
    re7 = compile("^..$")
    #matches strings that are exactly two characters long.
    strings = ("aa", "abaa", "aaaaa", "baab")
    for regex in (re1, re2, re3, re4, re5, re6, re7):
        for string in strings:
            regexProcess(regex, string)

#Regular expressions are excellent at detecting repetition. The *, +, {}, and ? operators detect various types of repetition.
#
#* matches the preceeding character (or group, more on that later) zero or more times.
#+ matches the preceeding character one or more times.
#{a,b} matches the preceeding character from a to b times.
#{a} matches the preceeding character exactly a times.
#? matches the preceeding character zero or one times.

def repetition():
    re1 = compile(".*")
    regexProcess(re1, "")
    #Notice the output - search and match return the empty string, but the
    #if statement in regexProcess is still triggered.
    regexProcess(re1, "The contents of this string don't matter!")
    #This will match absolutely anything, including the empty string.
    re2 = compile("Instrument destroyed by .*")
    regexProcess(re2, "Run completed successfully")
    regexProcess(re2, "Instrument destroyed by D. Teffers")
    #This will match a line that indicates your precious instrument was
    #destroyed, and it will also capture the rest of the line, allowing you
    #to identify the perpetrator.

    re3 = compile(".*$.*")
    #Strange one, this. Remember that * can match zero characters, so this
    #matches anything, then the end of the line, then anything. The second
    #anything is guaranteed to be empty, of course.
    re4 = compile("I like you thi+s much")
    regexProcess(re4, "I like you ths much!")
    regexProcess(re4, "I like you thiiiiiis much!")

#Character classes are specified in [brackets] A character class matches exactly
#one character, which must be a member of that class. A range of characters
#may be specified with a -.

def charClasses():
    re1 = compile("[a-z]*")
    #will match runs of lowercase characters.
    regexProcess(re1, "I am the very model!")
    #Note that this captures several empty matches, we can fix this by
    #switching from * to +:
    re2 = compile("[a-z]+")
    regexProcess(re2, "E. coli is a (common) model organism, I think.")
    #A character in a character class can be inverted with the ^ operator.
    #(^ does not mean beginning-of-line inside a character class.)

#\ flips the significance of a character. Metacharacters (like .*?+[]$) become
#literals, and some normal characters (like abc123) take on special meaning.

def backslash():
    #Here's a regex that's supposed to look for prices. They should start with
    #a dollar sign, then have numbers, then a period, then two numbers.
    re1 = compile("$[0-9]*.[0-9]{2}")
    #Unfortunately, it won't work - the dollar matches the end of a line, so
    #this regex will never match. So, we use \ to make the $ a literal, and
    #we do the same to the .
    re2 = compile("\$[0-9]*\.[0-9]{2}")
    #This one will work, but it's very bad style. More on that later.
    regexProcess(re2, "$1035.65 for a piece of gum?!")
    #\\ makes a literal backslash:
    re3 = compile("\\?[a-z]*")
    #This regex *should* optionally match a \, followed by any number of
    #letters.
    regexProcess(re3, "This text contains a \backslash.")
    #This doesn't work. It should find a search hit, but nothing comes up.
    #Also, at least on my machine, the b in backslash becomes a checkmark.
    #Unfortunately, \ is used in Python to indicate escape sequences, so we
    #must use raw string notation to tell Python to leave the backslashes in
    #place. Just precede a string with the letter r and Python will not
    #attempt to evaluate any escape sequences in it.
    #(Look at re3 when it's printed by regexProcess - it's coming up as \?[a-z]*
    #which means a literal ? followed by any number of lowercase letters.
    #In expanding escape sequences, python replaced \\ with a literal \, so the
    #regex compiler only got one backslash.)
    re4 = compile(r"\\?[a-z]+")
    regexProcess(re4, r"this text contains a \backslash.")
    #Using \ to make normal characters become special is a great utility.
    #As an example, \d matches any digit. \s matches whitespace, \S matches
    #non-whitespace, and so on. A list may be found in the online regex
    #documentation for Python.

#The | character specifies a choice when it is placed between two regexes.
#That is A|B matches either A or B, where A and B can be arbitrary regexes.
#Multiple regexes can be combined this way, A|B|C will try to match A, then
#B, then C.

#The final, but most important, element of regexes is parentheses.

#Parentheses around a regular expression create a "group". In most situations a group acts as a single character.
#So, (abc)* detects repeated occurrences of the string abc
def groups():
    #                         ATG                                                                                                                                                   TAA                                
    dna = "CGGCTTCCAGTGCCAATATATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCGGCAGAGACATTCAGTCTCAGTACTGGAAACTAAGGCTTCCAGTGCCAATATGAGCG"
    #we will use groups to locate the longest coding region in this DNA.
    #I have marked the start and stop codon for your reference. 
    re1 = compile("ATG([AGTC]{3})*(TAA|TAG|TGA)")
    regexProcess(re1, dna)
    #Notice the result of findall - it's not doing what it did before. Rather,
    #it's showing the matched groups. In the case of the repeated group
    #([AGTC]{3})* it returns the last instance of that group.
    #This is an absolutely fantabulous feature. It means you can extract parts
    #of your regex quickly and easily.

    #A further feature of a MatchObject is that group() can take an argument.
    #This specifies which group you want to see. If you provide no argument,
    #group returns the whole match. 

#Groups can also be referred to inside a regex. \1 refers to the first group, \2 refers to the second, and so on.
def backreferences():
    dna = "CGGCTTCCAGTGCCAATATATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCGGCAGAGACATTCAGTCTCAGTACTGGAAACTAAGGCTTCCAGTGCCAATATGAGCG"
    re1 = compile(r"([AGTC])([AGTC])([AGTC])\3\2\1")
    #This regex detects three-character palindromes. It matches three leters,
    #then those same letters in reverse order.
    regexProcess(re1, dna)
    re2 = compile(r"([AGTC]*).*\1")
    #This regex is cool - the group is any string of DNA, and the * operator
    #will find the longest string possible. Then .* allows for some number of
    #intervening characters, then the capture group is repeated. So, it finds
    #the longest string that is repeated anywhere in the DNA.
    #(Truth in advertising: I inserted a large duplication to show how neat
    # this regex is. You should try it on your favorite sequence!)
    regexProcess(re2, dna)

Regular expressions have a very different feel from the rest of Python.

This is because they are a separate language that Python happens to support.

Other common languages also use regular expressions heavily: the shell on Unix/Linux platforms, the Perl and Ruby programming languages, and many libraries use regexes as a common platform for working with text. 