# LING 242 Python Lecture 5: Regular expressions

* Escaping
* A review of basic regular expression syntax
* Representing language with regex
* Match objects
* Other regex methods
* Some expert regex syntax
* Exercises

## Escaping

Remember that declaring a string literal, you need to _escape_ a character when it would be misinterpreted as having another meaning in the context. For example, when using "" to indicate a string, any " inside the string would need to be escaped, by putting the escape character \ in front of it. Let's try writing a string literal for *"I can't," he said.*

In [1]:
S1 = "\"I can't,\" he said."
S2 = '"I can\'t," he said.'

Note that the string output of notebooks are string literals, and therefore escaped

In [2]:
print(S2)
S2

"I can't," he said.


'"I can\'t," he said.'

In [3]:
print(S1)
S1

"I can't," he said.


'"I can\'t," he said.'

Though escaping quote characters can usually be avoided in Python, sooner or later you will need to escape the escape character itself, for example in an (MS-DOS) filename.

In [4]:
path1 = "c:\files\videos"
path2 = "c:\\files\\videos"

In [5]:
print(path1)

c:ilesideos


In [6]:
print(path2)

c:\files\videos


There are several characters which can only be expressed using an escape sequence. The most useful of these by far are of course "\n" (newline) and "\t" (tab). Here is a [full list](https://docs.python.org/3/reference/lexical_analysis.html#literals), but you're not likely to come across the rest anymore (many of them are related to original typewriter commands).

In [7]:
S1 = "Hi.\nHow are you?"
S2 = "the\t2483"

In [8]:
print(S1)

Hi.
How are you?


In [9]:
print(S2)

the	2483


There are two ways to avoid having to escape backslashes. One is to use the triple quote syntax, which as you know is also used for commenting.

In [10]:
S = '''Hi.
How are you.'''
S

'Hi.\nHow are you.'

And the other is to use r (for raw) in front of the string literal.

In [11]:
import re

filename = r"C:\notes.txt"

print(filename)

C:\notes.txt


 You will want to use r when writing regular expressions that involve escape characters so you don't have to escape everything twice, since regexes also require the use of escape characters. To illustrate, let's try to write a regex to match the above without using r.

In [12]:
re.match("C:\\\\notes\.txt", filename)

<re.Match object; span=(0, 12), match='C:\\notes.txt'>

## A review of basic regular expression syntax

[Regular expressions](https://docs.python.org/3/library/re.html) (regexes) ultimately represent sets of strings. Their syntax can be tricky, so worth noting that there are websites, (e.g. [this one](https://regex101.com/])) that can help you with writing correct and accurate regexes.

When you don't use any special regex syntax, a regular expression represents only the string you give it. Note that if you're doing this, there's a better way using string methods, and for efficiency (regex comes with an overhead) you should use those instead. 

In [13]:
regex = "quick"
S = "The quick brown fox jumped over the 2 lazy dogs"
re.search(regex, S)

<re.Match object; span=(4, 9), match='quick'>

There are various ways to indicate groups of possible characters, look just above [here](https://docs.python.org/3/library/re.html#module-contents) in the Python docs for a list. Most useful: "." indicates any character, "\w" an alphabetic character, "\d" a digit. Carat "^" and "$" don't indicate characters, but rather the beginning and ending of a string (or sometimes a line)

In [14]:
regex1 = r"qu.\wk"
regex2 = r"\d"
regex3 = r"^dogs"
regex4 = r"dogs$"

In [15]:
re.search(regex1,S)

<re.Match object; span=(4, 9), match='quick'>

In [16]:
re.search(regex2,S)

<re.Match object; span=(36, 37), match='2'>

In [17]:
print(re.search(regex3,S))

None


In [18]:
re.search(regex4,S)

<re.Match object; span=(43, 47), match='dogs'>

A specific list of allowed characters can be indicated with brackets (\[\]). A hyphen can be used to indicate a range (most commonly A-Z or a-z). A bracket with a carat ("^") immediately after the first bracket indicates a set of *dis*allowed characters, often very useful for matching long strings between two delimiters (e.g. brackets).

In [19]:
S = "The quick brown fox jumped over the 2 lazy dogs"

regex1 = r"[Tt]he"
regex2 = r" [a-z][a-z][a-z] "
regex3 = r"[^aeiou ][^aeiou ][^aeiou ]"
regex4 = r" [^aeiou ][^aeiou ]"

In [20]:
re.search(regex1, S)

<re.Match object; span=(0, 3), match='The'>

In [21]:
re.search(regex2, S)

<re.Match object; span=(15, 20), match=' fox '>

In [22]:
re.search(regex3, S)

In [23]:
re.search(regex4, S)

<re.Match object; span=(9, 12), match=' br'>

A character followed by a question mark "?" indicates optionality. You can represent any number of a character with "\*", or "+" if you want there to be at least one. A specific number of instances (or a range) can be indicated by including a number in curly brackets ({}).

In [24]:
S1 = '"Yes"'
S2 = '"No"'
S3 = '""'

In [25]:
regex1 = r'^"[^"]+"$'

In [26]:
re.search(regex1, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [27]:
re.search(regex1, S2)

<re.Match object; span=(0, 4), match='"No"'>

In [28]:
re.search(regex1, S3)

In [29]:
S1 = '"Yes"'
S3 = '""'

regex2 = r'^"[^"]*"$'

In [30]:
re.search(regex2, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [31]:
re.search(regex2, S3)

<re.Match object; span=(0, 2), match='""'>

In [32]:
S1 = '"Yes"'
S2 = '"No"'
S3 = '""'

regex3 = r'^"[^"]{,3}"$'
print(re.search(regex3, S1))
print(re.search(regex3, S2))
print(re.search(regex3, S3))

<re.Match object; span=(0, 5), match='"Yes"'>
<re.Match object; span=(0, 4), match='"No"'>
<re.Match object; span=(0, 2), match='""'>


In [33]:
regex3 = r'^"[^"]{,3}"$'
print(re.search(regex3, S1))
print(re.search(regex3, S2))

<re.Match object; span=(0, 5), match='"Yes"'>
<re.Match object; span=(0, 4), match='"No"'>


In [34]:
re.search(regex3, S2)

<re.Match object; span=(0, 4), match='"No"'>

In [35]:
S1 = '"Yes"'
S2 = '"No"'

regex4 = r'^"[^"]{2}.?"$'

In [36]:
re.search(regex4, S1)

<re.Match object; span=(0, 5), match='"Yes"'>

In [37]:
re.search(regex4, S2)

<re.Match object; span=(0, 4), match='"No"'>

Use parenthesis to group sequences of "characters" (including other groups). This can be used together with ?,\*,and +, and is necessary to make good use of the pipe operator (|) which in regex means *or*. Note that whereas both \[\] and | can be used to list options, \[\] only represents individual characters, and | should be used for larger groups  

In [38]:
S = "I have two front teeth but one tooth is loose."

In [39]:
regex2 = r"t(ee|oo)th"
for result in re.finditer(regex2,S):
    print(result)

<re.Match object; span=(17, 22), match='teeth'>
<re.Match object; span=(31, 36), match='tooth'>


In [40]:
S = "This is a sentence, I think. But I'm not sure, are you? You are!"

In [41]:
regex = r"(^|[.?!] )\w+ "

for result in re.finditer(regex,S):
    print(result)

<re.Match object; span=(0, 5), match='This '>
<re.Match object; span=(27, 33), match='. But '>
<re.Match object; span=(54, 60), match='? You '>


Note that special regex characters need to escaped you want to use them literally outside of brackets. 

In [42]:
S1 = "345.955"
S2 = "96504"

regex = r"\d+\.\d+"

print(re.match(regex,S1))
print(re.match(regex,S2))

<re.Match object; span=(0, 7), match='345.955'>
None


## Representing language with regex

Much basic linguistic phenomena can be represented with regular expressions. Let's first write a regex that could match possible inflections of a regular English verb, including capitalization variation

In [43]:
forms = ["Jump","jumps","jumped","jumping"]

regex = r"[jJ]ump(s|ed|ing)?"

for form in forms:
    print(re.match(regex,form))

<re.Match object; span=(0, 4), match='Jump'>
<re.Match object; span=(0, 5), match='jumps'>
<re.Match object; span=(0, 6), match='jumped'>
<re.Match object; span=(0, 7), match='jumping'>


Let's create a regex like the above, but one that can handle all forms of the verb "catch" and not produce invalid forms:

In [44]:
forms = ["catch", "catches", "caught", "catching"]
invalid = ["catched","catchs","catchesed","caughting"]

regex = r"ca(tch(es|ing)?|ught)$"

print("good forms")
for form in forms:
    print(re.match(regex,form))
print("bad forms")
for form in invalid:
     print(re.match(regex,form))   

good forms
<re.Match object; span=(0, 5), match='catch'>
<re.Match object; span=(0, 7), match='catches'>
<re.Match object; span=(0, 6), match='caught'>
<re.Match object; span=(0, 8), match='catching'>
bad forms
None
None
None
None


Relatively fixed phrases can often be identified fairly accurately with regular expresssions. What about if we wanted to find various forms of the phrase *blow someone's mind*? (limited to one word between _blow_ and _mind_)

In [45]:
test_cases = ["blow your mind", "blew my mind", "blowing John's mind"]

regex = r"bl[oe]w(s|ing)? (his|my|her|your|our|their|[A-Z]\w+'s) mind"

for test_case in test_cases:
    print(re.match(regex,test_case))

<re.Match object; span=(0, 14), match='blow your mind'>
<re.Match object; span=(0, 12), match='blew my mind'>
<re.Match object; span=(0, 19), match="blowing John's mind">


We can also use regular expressions to model very simple languages. Here is a regex that will match all sentences that can be created by the following verbs and nouns, note that we are building our regex progammatically, which might be the right choice if your expression involves a lexicon.

In [46]:
nouns = ["people", "dogs", "cats", "mice", "computers"]
verbs = ["like", "eat", "obey"]

regex = "^(" + "|".join([noun.title() for noun in nouns]) + ") (" + "|".join(verbs) + ") (" + "|".join(nouns) + ")\.$"

test_cases = ["Cats eat mice.", "People like computers.", "Dogs obey people."]

for test_case in test_cases:
    print(re.match(regex,test_case))

<re.Match object; span=(0, 14), match='Cats eat mice.'>
<re.Match object; span=(0, 22), match='People like computers.'>
<re.Match object; span=(0, 17), match='Dogs obey people.'>


Larger chunks of language get increasingly more challenging to represent well using regexes without making assumptions that either miss cases or overgeneralize. 

Fun fact: There are languages which provably cannot be represented by regular expression. However, English is not one of them! That said, representing English (or any human language) by regex would be very, very painful, however. I do not recommend it. There are much better representations!

### Match Objects

If you just want to see if a string is one of those the set of strings your regex defines, use the [match](https://docs.python.org/3/library/re.html#re.match) method. However, if you want to find a (single) match within a larger span of text, use [search](https://docs.python.org/3/library/re.html#re.search). Use [finditer](https://docs.python.org/3/library/re.html#re.finditer) if you want multiple matches.

In [47]:
regex = r"[Jj]ump(s|ed|ing)?"

S1 = "jumps"
S2 = "Stop jumping on the bed"
S3 = "Jumped off the bus and jumped on the subway"
S4 = "jump"

In [48]:
regex = r"[Jj]ump(s|ed|ing)?"
print(re.match(regex,S4))

regex4 = r"[Jj]ump(s|ed|ing)"
print(re.match(regex4,S4))

<re.Match object; span=(0, 4), match='jump'>
None


In [49]:
print(re.match(regex,S1))

<re.Match object; span=(0, 5), match='jumps'>


In [50]:
print(re.match(regex,S2))

None


In [51]:
print(re.search(regex,S2))

<re.Match object; span=(5, 12), match='jumping'>


In [52]:
for match in re.finditer(regex,S3):
    print(match)

<re.Match object; span=(0, 6), match='Jumped'>
<re.Match object; span=(23, 29), match='jumped'>


All these functions return [match objects](https://docs.python.org/3/library/re.html#match-objects), which evalute to True (if you only need to check if a search was successful) but also allow you to access the specific match, using [group](https://docs.python.org/3/library/re.html#re.Match.group) and its position in the string, via the [start](https://docs.python.org/3/library/re.html#re.Match.start) and [end](https://docs.python.org/3/library/re.html#re.Match.end) methods. 


In [53]:
match = re.search(regex,S2)

In [54]:
if match:
    print("A match!")

A match!


In [55]:
print(match.group())

jumping


In [56]:
S2[:match.start()]

'Stop '

In [57]:
S2[match.end():]

' on the bed'

In additional to the whole string, including parenthesis in your regex also allows you to access particular parts of the match, also using the group method. The first set of parenthesis will correspond to group(1), etc.

In [58]:
S = "Stop jumping on the bed"
regex = r"([Jj]ump)(s|ed|ing)?"
match = re.search(regex,S)

In [59]:
match.group(0)

'jumping'

In [60]:
match.group(1)

'jump'

In [61]:
match.group(2)

'ing'

Note there is also a [findall](https://docs.python.org/3/library/re.html#re.findall) method which can be used to get a list of all strings that match, circumventing match objects entirely. Note, however, it does perhaps unexpected things (creating lists of tuples) when you have regexes with (grouping) parentheses, which will be a fairly common scenario unless you use non-grouping parenthesis (see below in optional). 

Also, `finditer` is nice in that it is actually a generator, processing each match one by one in the context of a for loop. For this course, I strongly recommend you use `finditer` in cases where you are looking for multiple matches (you may lose quality points if you do use `findall` and it results in confusing code)!

## Other regex methods

If you want to split a string but the delimiter does not correspond to a fixed phrase, you can use the regex version of [split](https://docs.python.org/3/library/re.html#re.split)

In [62]:
S = "each\nword\thas,a;different:delimiter"
re.split("[\n\t,;:]", S)

['each', 'word', 'has', 'a', 'different', 'delimiter']

Another useful method is [sub](https://docs.python.org/3/library/re.html#re.sub), which creates a new version of the input string with substitutions made. This is similar to the string method `replace`, but it is a lot more flexible (though if you are just replacing a fixed string with a fixed string, much better to use replace!). You can specify strings pulled out from the match within your replacement string using \1, \2, etc. where the number corresponds to the group number.

In [63]:
S = "This is August 4th, 2019 (8/4/2019)"

regex = "(\d\d?)/(\d\d?)/(\d\d\d?\d?)"
match = re.search(regex,S)
print(match.group(3))

print(re.sub(regex, r"\3年\1月\2日", S))

2019
This is August 4th, 2019 (2019年8月4日)


Also, if you are going to use a particular regular expression multiple times, it is a good idea to [compile](https://docs.python.org/3/library/re.html#re.compile) it once, and just use the methods of the compiled regular expression objects, which are equivalent to the `re` module functions except there is no need to specify the regex. Much more efficient!


In [64]:
texts = ["jumps", "Stop jumping on the bed", "Jumped off the bus and jumped on the subway"]
regex = "[Jj]ump(s|ed|ing)?"

compiled = re.compile(regex)
for text in texts:
    print(compiled.search(text))


<re.Match object; span=(0, 5), match='jumps'>
<re.Match object; span=(5, 12), match='jumping'>
<re.Match object; span=(0, 6), match='Jumped'>


## Some expert regex syntax (Optional)

For efficiency or clarity, you may want to group elements in your regex without causing them to becoming a group in the resulting match object. Use (?:..) instead of (...)

In [65]:
regex = r"[Jj]ump(?:s|ed|ing)?"
S = "jumps"
match = re.match(regex,S)
print(match)
# match.group(1)

<re.Match object; span=(0, 5), match='jumps'>


Python also has syntax that lets you check to see if a *fixed size* string appears directly before or after the string you are looking for; this is known as a positive lookahead (?=...) or lookback (?<=...) assertion. The same functionality can be generally be accomplished by matching the larger string and then getting the group you want. 

In [66]:
S = "Why do you make that expression when I mention regular expressions?"
regex = r"(?<=[Rr]egular )[Ee]xpressions?"

for match in re.finditer(regex,S):
    print(match)

<re.Match object; span=(55, 66), match='expressions'>


More useful is are the negative lookahead (?!...) and lookback (?<!...) assertions, which allows you to match only when a string doesn't appear. For instance, you could pull out instances of a word except when it appears in a context you aren't interested in.

In [67]:
regex = r"(?<![Rr]egular )[Ee]xpressions?"
for match in re.finditer(regex,S):
    print(match)

<re.Match object; span=(21, 31), match='expression'>


## Exercises

Write regular expressions which correspond to the following patterns and test them with the provided examples (code for testing at the bottom, same as exercise 1 on the lab)

1. A North American telephone number, with various area code options

In [68]:
positive_examples = ["(604)753-9438","1-800-342-9502","934-5204"]
negative_examples = ["A phone number","234536345","(999)1111-2222","(999)111-22222"]

regex = r"^(1-\d{3}-|\(\d{3}\))?\d{3}-\d{4}$"

2. An equation consisting of sums or differences of integers, neatly spaced (though not necessarily correct)

In [69]:
positive_examples = ["2 + 2 = 4", "1 + 10 + 100 + 1000 - 999 = 112"]
negative_examples = ["1 = 1", "n + 1 = 3", "1+1="]

regex = r"(\d+ [+-] )+\d+ = \d+"

3. An occurrence of the English progressive aspect in a larger text (e.g "be \*ing")

In [70]:
positive_examples = ["I am waiting for help", "Who is knocking", "We are laughing"]
negative_examples = ["the killing", "They are singers"]

regex = r"(am|is|be|are) [a-z]*ing($| )"

4. The beginning of a line of dialogue in a Shakespeare play where dialogue is indented and starts with the name of the character in upper case. You should not match lines with just dialogue (with are indented), nor scene/act headers.

In [71]:
positive_examples = ["  KING HENRY. Send for him, good uncle.", "  CAMBRIDGE. I do confess my fault,"]
negative_examples = ["  may call beasts.","SCENE VII. The French camp near Agincourt"]

regex = r"  [A-Z ]+\."

5. A numbered list with 3 items, one per line

In [72]:
positive_examples= ["1. Conditionals\n2. Functions\n3. Loops", "1. Collect underpants\n2. ?\n3. Profit"]
negative_examples = ["1.2.3.","3. Count\n,2. me\n 1. down"]
                    
regex = r"^1\. [^\n]+\n2\. [^\n]+\n3\. .+" 

Code below for testing

In [73]:
pattern = re.compile(regex)

for example in positive_examples:
    assert pattern.search(example), example + " not correctly included"

for example in negative_examples:
    assert not pattern.search(example), example + " not correctly excluded"

print("Success!")

Success!
