# COLX 521 Lecture 5: Regular expressions

* Escaping
* Basic regular expression syntax
* Representing language with regex
* Match objects
* Advanced regular expression  techniques

## Escaping


When declaring a string literal, you need to escape a character when it would be misinterpreted as having another meaning in the context. For example, when using "" to indicate a string, any " inside the string would need to be escaped, by putting the escape character \ in front of it. Let's try writing a string literal for *"I can't," he said.*

In [71]:
S1 = "\"I can't,\" he said."
S2 = '"I can\'t," he said.'

In [72]:
S1

'"I can\'t," he said.'

"I can't you'd he said."


'"I can\'t you\'d he said."'

In [64]:
print(S1)
S1

"I can't," he said.


'"I can\'t," he said.'

In [4]:
print(S2)
S2

"I can't," he said.


'"I can\'t," he said.'

Though escaping quote characters can usually be avoided in Python, sooner or later you will need to escape the escape character itself, for example in an (MS-DOS) filename.

In [99]:
#provided code
path1 = "c:\files\videos"
path2 = "c:\\files\\videos"

In [11]:
print(path1)

c:ilesideos


In [12]:
print(path2)

c:\files\videos


There are several characters which can only be expressed using an escape sequence. The most useful of these by far are \n (newline) and \t (tab). Here is a [full list](https://docs.python.org/3/reference/lexical_analysis.html#literals), but you're not likely to come across the rest anymore (many of them are related to original typewriter commands).

In [50]:
#provided code
S1 = "Hi.\nHow are you?"
S2 = "the\t2483"

In [55]:
print(S1)

Hi.
How are you?


In [52]:
print(S2)

the	2483


There are two ways to avoid having to escape backslashes. One is to use the triple quote syntax, which is also popular for commenting

In [17]:
S = '''Hi.
How are you.'''
# my code here
S
# my code here

'Hi.\nHow are you.'

And the other is to use r (for raw) in front of the string literal. You will want to use the latter when writing regular expressions so you don't have to escape everything twice, since regexes also require the use of escape characters (that's why we are talking about them now). To illustrate, suppose we were trying to write a regex to find mentions of a file called new.txt on the C drive....

In [2]:
import re

filename = r"C:\notes.txt"

# my code here
print(re.match("C:\\\\notes\\.txt", filename))
# my code here

<re.Match object; span=(0, 12), match='C:\\notes.txt'>


## Basic regular expression syntax

[Regular expressions](https://docs.python.org/3/library/re.html) (regexes) ultimately represent sets of strings. Their syntax can be tricky, so worth noting that there are websites, (e.g. [this one](https://regex101.com/])) that can help you with writing correct and accurate regexes.

When you don't use any special regex syntax, a regular expression represents only the string you give it. Note that if you're doing this, there's a better way using string methods, and for efficiency (regex comes with an overhead) you should use those instead. 

In [3]:
regex = "quick"
S = "The quick brown fox jumped over the 2 lazy dogs"
# my code here
re.search(regex, S)
# my code here

<re.Match object; span=(4, 9), match='quick'>

There are various ways to indicate groups of possible characters, look just above [here](https://docs.python.org/3/library/re.html#module-contents) in the Python docs for a list. Most useful: "." indicates any character, "\w" an alphabetic character, "\d" a digit. Carat "^" and "$" don't indicate characters, but rather the beginning and ending of a string (or sometimes a line)

In [4]:
#provided code
regex1 = r"qu.\wk"
regex2 = r"\d"
regex3 = r"^dogs"
regex4 = r"dogs$"

In [5]:
re.search(regex1,S)

<re.Match object; span=(4, 9), match='quick'>

In [6]:
re.search(regex2,S)

<re.Match object; span=(36, 37), match='2'>

In [7]:
print(re.search(regex3,S))

None


In [8]:
re.search(regex4,S)

<re.Match object; span=(43, 47), match='dogs'>

A specific list of allowed characters can be indicated with brackets (\[\]). A hyphen can be used to indicate a range (most commonly A-Z). A bracket with a carat ("^") immediately after the first bracket indicates a set of *dis*allowed characters, often very useful for identifying things like matching parenthesis, brackets, etc.

In [100]:
#provided code
regex1 = r"[Tt]he"
regex2 = r" [a-z][a-z][a-z] "
regex3 = r"[^aeiou ][^aeiou ][^aeiou ]"
regex4 = r" [^aeiou ][^aeiou ]"

In [37]:
re.search(regex1, S)

<_sre.SRE_Match object; span=(0, 3), match='The'>

In [38]:
re.search(regex2, S)

<_sre.SRE_Match object; span=(15, 20), match=' fox '>

In [39]:
re.search(regex3, S)

In [42]:
re.search(regex4, S)

<_sre.SRE_Match object; span=(9, 12), match=' br'>

A character followed by a question mark "?" indicates optionality. You can represent any number of a character with "\*", or "+" if you want there to be at least one. A specific number of instances (or a range) can be indicated by including a number in curly brackets ({}).

In [10]:
#provided code
S1 = '"Yes"'
S2 = '"No"'
S3 = '""'

In [11]:
#provided code
regex1 = r'^"[^"]+"$'

In [12]:
re.search(regex1, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [13]:
re.search(regex1, S2)

<re.Match object; span=(0, 4), match='"No"'>

In [14]:
re.search(regex1, S3)

In [15]:
#provided code
regex2 = r'^"[^"]*"$'

In [16]:
re.search(regex2, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [17]:
re.search(regex2, S3)

<re.Match object; span=(0, 2), match='""'>

In [18]:
#provided code
regex3 = r'^"[^"]{3}"$'

In [19]:
re.search(regex3, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [55]:
re.search(regex3, S2)

In [102]:
#provided code
regex4 = r'^"[^"]{2}.?"$'

In [56]:
re.search(regex4, S1)

<_sre.SRE_Match object; span=(0, 5), match='"Yes"'>

In [57]:
re.search(regex4, S2)

<_sre.SRE_Match object; span=(0, 4), match='"No"'>

Use parenthesis to group sequences of "characters" (including other groups). This can be used together with ?,\*,and +, and is necessary to make good use of the pipe operator (|) which in regex means *or*. Note that whereas both \[\] and | can be used to list options, \[\] only represents individual characters, and | should be used for larger groups  

In [25]:
#provided code
S = "This is a sentence, I think. But I'm not sure, are you? You are!"

In [27]:
regex = r"(^|[.?!,;] )\w+ "

for result in re.finditer(regex,S):
    print(result)

<re.Match object; span=(0, 5), match='This '>


In [65]:
# provided code
S = "I have two front teeth but one tooth is loose."

In [None]:
regex2 = r"t(ee|oo)th"
for result in re.finditer(regex1,S1):
    print(result)

Note that special regex characters need to escaped when they appear outside of brackets. 

In [28]:
S1 = "345.955"
S2 = "96504"

# my code here

regex = r"\d+\.\d+"

# my code here

print(re.match(regex,S1))
print(re.match(regex,S2))

<re.Match object; span=(0, 7), match='345.955'>
None


## Representing language with regex

Much basic linguistic phenomena can be represented with regular expressions. Let's first write a regex that could match possible inflections of a regular English verb

In [81]:
forms = ["Jump","jumps","jumped","jumping"]

# my code here
regex = r"[jJ]ump(s|ed|ing)?"
# my code here

for form in forms:
    print(re.match(regex,form))

<_sre.SRE_Match object; span=(0, 4), match='Jump'>
<_sre.SRE_Match object; span=(0, 5), match='jumps'>
<_sre.SRE_Match object; span=(0, 6), match='jumped'>
<_sre.SRE_Match object; span=(0, 7), match='jumping'>


Exercise: Create a regex like the above, but one that can handle all forms of the verb "catch" and not invalid forms

In [80]:
forms = ["catch", "catches", "caught", "catching"]
invalid = ["catched","catchs"]

# your code here
regex = r"(catch(es|ing)?$|caught)"
# your code here
for form in forms:
    print(re.match(regex,form))
    
for form in invalid:
     print(re.match(regex,form))   

<_sre.SRE_Match object; span=(0, 5), match='catch'>
<_sre.SRE_Match object; span=(0, 7), match='catches'>
<_sre.SRE_Match object; span=(0, 6), match='caught'>
<_sre.SRE_Match object; span=(0, 8), match='catching'>
None
None


Relatively fixed phrases can often be represented fairly accurately with regular expresssions. What about if we wanted to find forms of the phrase *blow someone's mind*?

In [14]:
test_cases = ["blow your mind", "blew my mind","blowing John's mind"]

# my code here
regex = r"bl[oe]w(s|ing)? (his|my|her|your|our|their|[A-Z]\w+'s) mind"
# my code here

for test_case in test_cases:
    print(re.match(regex,test_case))

<re.Match object; span=(0, 14), match='blow your mind'>
<re.Match object; span=(0, 12), match='blew my mind'>
<re.Match object; span=(0, 19), match="blowing John's mind">


We can also use regular expressions to model very simple languages. Let's create a regex that will match all sentences that can be created by the following verbs and nouns.

In [3]:
nouns = ["people", "dogs", "cats", "mice", "computers"]
verbs = ["like", "eat", "obey"]
test_cases = ["Cats eat mice.", "People like computers.","Dogs obey people."]

#my code here
regex = r"^(" + "|".join([noun.title() for noun in nouns]) + ") (" + "|".join(verbs) + ") (" + "|".join(nouns) + ")\.$"
print(regex)
#my code here

for test_case in test_cases:
    print(re.match(regex,test_case))


^(People|Dogs|Cats|Mice|Computers) (like|eat|obey) (people|dogs|cats|mice|computers)\.$
<_sre.SRE_Match object; span=(0, 14), match='Cats eat mice.'>
<_sre.SRE_Match object; span=(0, 22), match='People like computers.'>
<_sre.SRE_Match object; span=(0, 17), match='Dogs obey people.'>


Larger chunks of language get increasingly more challenging to represent well using regexes without making assumptions that either miss cases or overgeneralize. Let's try to represent the basic structure of a list (3 or items) in (written) English.

In [29]:
simple_list = "red, green, and blue"
# my code here
regex = r"(\w+, )+and \w+"
# my code here
re.match(regex,simple_list)

<re.Match object; span=(0, 20), match='red, green, and blue'>

Exercise: Come up with a case of a list that isn't covered by what we just did (and show that it isn't). Modify the regular expression so that it is covered. Now come up with a case that isn't a list but that would be matched by the regex you just wrote.

In [30]:
other_list = "red, green and blue"
print(re.match(regex,other_list))

None


In [104]:
new_regex = r"(\w+,? )+and \w+"
print(re.match(new_regex,other_list))

<_sre.SRE_Match object; span=(0, 19), match='red, green and blue'>


In [105]:
not_a_list = "not fun and games"
print(re.match(new_regex,not_a_list))

<_sre.SRE_Match object; span=(0, 17), match='not fun and games'>


Fun fact: There are languages which provably cannot be represented by a regular expression. However, English is not one of them! Representing English (or any human language) by regex would be very, very painful, however. I do not recommend it. There are much better representations which we will talk about later in the program.

### Match Objects

If you just want to see if a string is one of those the set of strings your regex defines, use the [match](https://docs.python.org/3/library/re.html#re.match) method. However, if you want to find a (single) match within a larger span of text, use [search](https://docs.python.org/3/library/re.html#re.search). Use [finditer](https://docs.python.org/3/library/re.html#re.finditer) if you want multiple matches.

In [44]:
#provided code
regex = r"[Jj]ump(s|ed|ing)?"

S1 = "jumps"
S2 = "Stop jumping on the bed"
S3 = "Jumped off the bus and jumped on the subway"

In [32]:
print(re.match(regex,S1))

<re.Match object; span=(0, 5), match='jumps'>


In [None]:
print(re.match(regex,S2))

In [None]:
print(re.search(regex,S2))

In [45]:
for match in re.finditer(regex,S3):
    print(match)

<re.Match object; span=(0, 6), match='Jumped'>
<re.Match object; span=(23, 29), match='jumped'>


All these functions return [match objects](https://docs.python.org/3/library/re.html#match-objects), which evalute to True (if you only need to check if a search was successful) but also allow you to access the specific match, using [group](https://docs.python.org/3/library/re.html#re.Match.group) and its position in the string, via the [start](https://docs.python.org/3/library/re.html#re.Match.start) and [end](https://docs.python.org/3/library/re.html#re.Match.end) methods. 


In [46]:
#provided code
match = re.search(regex,S2)

In [47]:
if match:
    print("A match!")

A match!


In [48]:
print(match.group())

jumping


In [49]:
S2[:match.start()]

'Stop sjdadhjadha '

In [50]:
S2[match.end():]

' askjdajdlasjdlkj on the bed'

In additional to the whole string, including parenthesis in your regex also allows you to access particular parts of the match, also using the group method. The first set of parenthesis will correspond to group(1), etc.

In [87]:
# provided code
S = "Stop jumping on the bed"
regex = r"([Jj]ump)(s|ed|ing)?"
match = re.search(regex,S)


In [88]:
match.group(0)

'jumping'

In [89]:
match.group(1)

'jump'

In [90]:
match.group(2)

'ing'

## Advanced regular expression techniques

Another useful method is [sub](https://docs.python.org/3/library/re.html#re.sub), which creates a new version of the input string with substitutions made. This is similar to the string method `replace`, but it is a lot more flexible (though if you are just replacing a fixed string with a fixed string, much better to use replace!). You can specify strings pulled out from the match within your replacement string using \1, \2, etc. where the number corresponds to the group number.

In [57]:
S = "This is August 4th, 2019 (4/8/2019)"

regex = r"(\d\d?)/(\d\d?)/(\d\d\d?\d?)"
match = re.search(regex,S)
print(match.group(3))

# my code here
print(re.sub(regex, r"\2/\1/\3", S))
# my code here

2019
This is August 4th, 2019 (8/4/2019)


If you are going to use a particular regular expression multiple times, it is a good idea to [compile](https://docs.python.org/3/library/re.html#re.compile) it once, and just use the methods of the compiled regular expression objects, which are equivalent to the re module functions except there is no need to specify the regex.


In [96]:
#provided code
texts = ["jumps", "Stop jumping on the bed", "Jumped off the bus and jumped on the subway"]
regex = r"[Jj]ump(s|ed|ing)?"

compiled = re.compile(regex)
for text in texts:
    print(compiled.search(text))


<_sre.SRE_Match object; span=(0, 5), match='jumps'>
<_sre.SRE_Match object; span=(5, 12), match='jumping'>
<_sre.SRE_Match object; span=(0, 6), match='Jumped'>


For efficiency or clarity, you may want to group elements in your regex without causing them to becoming a group in the resulting match object. Use (?:..) instead of (...)

In [58]:
regex = r"[Jj]ump(?:s|ed|ing)?"
S = "jumps"
# my code here
match = re.match(regex,S)
print(match)
match.group(1)
# my code here

<re.Match object; span=(0, 5), match='jumps'>


IndexError: no such group

Python also has syntax that lets you check to see if a *fixed size* string appears directly before or after the string you are looking for; this is known as a positive lookahead (?=...) or lookback (?<=...) assertion. The same functionality can be generally be accomplished by matching the larger string and then getting the group you want. 

In [81]:
#provided code
S = "Why do you make that expression when I mention regular expressions?"
regex = r"(?<=[Rr]egular )[Ee]xpressions?"

for match in re.finditer(regex,S):
    print(match)
S[:match.start()]


<re.Match object; span=(55, 66), match='expressions'>


'Why do you make that expression when I mention regular '

More useful is are the negative lookahead (?!...) and lookback (?<!...) assertions, which allows you to match only when a string doesn't appear. For instance, you could pull out instances of a word except when it appears in a context you aren't interested in.

In [85]:
#provided code
regex = r"(?<![Rr]egular )[Ee]xpressions?"
for match in re.finditer(regex,S):
    print(match)


<re.Match object; span=(21, 31), match='expression'>
