We will be studying regular expressions in this lecture. 

As a motivating example, you may be familiar with searching for texts by pressing CTRL-F and typing in the words you’re looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. You may not know a business’s exact phone number, but if you live in the United States or Canada, you know it will be three digits, followed by a hyphen, and then four more digits (and optionally, a three-digit area code at the start). This is how you, as a human, know a phone number when you see it: 415-555-1234 is a phone number, but 4,155,551,234 is not.

Regular expressions are helpful, but not many non-programmers know about them even though most modern text editors and word processors, such as Microsoft Word or OpenOffice, have find and find-and-replace features that can search based on regular expressions. Regular expressions are huge time-savers, not just for software users but also for programmers.

In this chapter, we’ll learn more about what is an regular expression (a.k.a. regex) and apply them to pattern matching tasks. This chapter is very important because it lays the foundation for text mining and natural language processing. 

To start with, say we want to find a phone number in a string. You know the pattern: three numbers, a hyphen, three numbers, a hyphen, and four numbers. Here’s an example: 415-555-4242. Let’s create a function named isPhoneNumber() to check whether a string matches this pattern, returning either 'True' or 'False'. We can now use the function below to find phone number in a text string. This can be done by iteration in Python. On each iteration of the 'for' loop, a new 'chunk' of 12 characters from message is assigned to the variable 'chunk'. The loop goes through the entire string, testing each 12-character piece and printing any 'chunk' it finds that satisfies isPhoneNumber(). Once we’re done going through message, we print 'Done'. 

In [1]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True
print(isPhoneNumber('415-555-4242')) # True
print(isPhoneNumber('Moshi moshi')) # False
print("")

message = 'Call my cell at 415-555-1011 tomorrow, if you cannot find me, 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

True
False

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


The previous phone number–finding program works fine, but it uses a lot of code to do something limited: the function  isPhoneNumber() is 17 lines but can find only one pattern of phone numbers. What about a phone number formatted like 415.555.4242 or (415) 555-4242? What if the phone number had an extension, like 415-555-4242 x99? The isPhoneNumber() function would fail to validate them. You could add yet more code for these additional patterns, but there is an easier way.

Regular expressions, called 'regexes' for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character — that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text the previous isPhoneNumber() function did: a string of three numbers, a hyphen, three more
numbers, another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d\d\d regex. Moreoever, regular expressions can be much more sophisticated. For example, adding a 3 in curly brackets {3} after a pattern is like saying, "Match this pattern three times." So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number format. In all, the benefit of using regular expressions is enormous, and it can handle a large amount of data in different forms. 

We now need to learn how to create regexes in Python. Most of the regex functions in Python are in the 're' module. Passing a string value representing your regular expression to re.compile() returns a regex pattern object (or simply, a regex object). There are many methods associated with a regex object. For example, a regex object’s search() method searches the string it is passed for any matches to the regex. The search() method will return 'None' if the regex pattern is not found in the string. However, if the pattern is found, the search() method returns a 'Match' object. This 'Match' object, just like other objects in Python, have many associated methods with it. For example, the 'Match' objects have a group() method that will return the actual matched text from the searched string. 

In [2]:
import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # creating a regex object that matches the phone number pattern
regex_object = phoneNumRegex.search('Call my cell at 415-755-1011 tomorrow')
print(regex_object)
print('Phone number found: ' + regex_object.group())

<_sre.SRE_Match object; span=(16, 28), match='415-755-1011'>
Phone number found: 415-755-1011


Remember that escape characters in Python use the backslash sign '\'. The string value '\n' represents a single newline
character, not a backslash followed by a lowercase n. You need to enter the escape character '\' to print a single
backslash. So '\\n' is the string that represents a backslash followed by a lowercase n. However, by putting an 'r'
before the first quote of the string value, you can mark the string as a raw string, which does not escape characters.
Since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the re.compile()
function instead of typing extra backslashes. 

The search() method, as demonstrated above, is very useful for many scenarios. However, if you would like to find all phone numbers in a given string and iteratively display them, you will need to invoke the findall() method. The output of the findall() method is a list. 

In [3]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # creating a regex object that matches the phone number pattern
regex_object = phoneNumRegex.findall("Call me at 415-755-1011 tmr or 310-506-7320, and don't use 310-472-2342")
print(regex_object) 
print(type(regex_object)) # this is a list

for i in range(len(regex_object)):
    phonenum=regex_object[i]
    print('Phone number found: ' + phonenum)

['415-755-1011', '310-506-7320', '310-472-2342']
<class 'list'>
Phone number found: 415-755-1011
Phone number found: 310-506-7320
Phone number found: 310-472-2342


Now that we have learned the basic for creating and finding regular expression objects with Python, we may try to explore some more powerful pattern-matching capabilities. Suppose we want to separate the area code from the rest of the phone number. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the group() 'match' object method to grab the matching text from just one group. The first set of parentheses in a regex string will be group 1. The second set will be group 2, etc.. By passing the integer 1 or 2 to the group() 'match' object method, you can grab different parts of the matched text. Passing 0 or nothing to the group() method will return
the entire matched text. If you would like to retrieve all the groups at once, use the groups() method (plural). However, the result of the groups() method (plural) would be a tuple. 

In [4]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('area code: '+ mo.group(1))
print('body of phone number: ' + mo.group(2))
print('full number: ' + mo.group(0))
print('full number (another way): ' + mo.group())
print(mo.groups()) # this is a tuple

area code: 415
body of phone number: 555-4242
full number: 415-555-4242
full number (another way): 415-555-4242
('415', '555-4242')


Parentheses always have a special meaning in regular expressions, but what do you do if you need to match a parenthesis in your text? For instance, maybe the phone numbers you are trying to match have the area code set in parentheses. In this case, you need to escape the left and right parentheses characters with a backslash preceding the parenthesis.

In [5]:
phoneNumRegex = re.compile(r'(\(\d\d\d\))(\d\d\d-\d\d\d\d)')
mo2 = phoneNumRegex.search('My phone number is (415)555-4242.')
print(mo2.group(1))
print(mo2.group(2))

(415)
555-4242


We now learn how to use pipe "|" to match one of many expressions. You can think of the pipe operator as an 'or' in logic. For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or 'Tina Fey'. When both Batman and Tina Fey occur in the searched string, the first occurrence of matching text will be returned as the 'Match' object. Below is an example:

In [6]:
heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey.')
print(mo1.group())
mo2 = heroRegex.search('Tina Fey and Batman.')
print(mo2.group())
mo3 = heroRegex.findall('Batman and Tina Fey.')
print(mo3)

Batman
Tina Fey
['Batman', 'Tina Fey']


In the pipe can be roughly considered as an 'or' logic, we can extend its use beyond what is shown in the previous example. For example, we can also use the pipe to match one of several patterns as part of your regex. Suppose we want to match any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with the prefix 'Bat', it would be nice if we could specify that prefix only once. This can be done with parentheses. By using the pipe character and grouping parentheses, you can specify several alternative patterns we would like our regex to match.

If we need to match an actual pipe character, we can escape it with a backslash.

Below is an example of using the pipe and parentheses combination:

In [7]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo3 = batRegex.search('Batmobile lost a wheel')
print(mo3.group())
print(mo3.group(1)) # mobile

Batmobile
mobile


Sometimes there is a pattern that we want to match only optionally. That is, the regex should find a match whether or not that bit of text is there. To deal with this type of problems, the question mark "?" character flags the group that precedes it as an optional part of the pattern. Below is an example:

In [8]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group()) # the '(wo)?' part of the regular expression means that the pattern wo is an optional group
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
print(mo2.group(1)) # wo

Batman
Batwoman
wo


Here comes more tricks. The star (or asterisk) sign for regex means "match zero or more" — the group that precedes the star can occur any number of times in the text. It can be completely absent or repeated over and over again. 

Again, if we need to match an actual star character, we can prefix the star in the regular expression with a backslash as usual.

In [9]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batwowowowoman')
print(mo3.group())
print(mo3.group(1)) # wo

Batman
Batwoman
Batwowowowoman
wo


While the star/asterisk sign * means "match zero or more", the plus sign + means "match one or more". Unlike the star, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. It is not optional. Below is an example. The regex Bat(wo)+man will not match the string 'The Adventures of Batman' because at least one 'wo' is required by the plus sign.

If we need to match an actual plus sign character, again, we can prefix the plus sign with a backslash to escape it.

In [10]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batman')
print(mo3==None)

Batwoman
Batwowowowoman
True


Let's move on for another trick. If we have a group that you want to repeat a specific number of times, we can follow the group in our regex with a number in curly brackets. For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa', since the latter has only two repeats of the (Ha) group.

When it comes to curly brackets, there is another trick: instead of one number, we can specify a range by writing a minimum, a comma, and a maximum in between the curly brackets. For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'. While doing this type of tricks, we can also leave out the first or second number in the curly brackets to leave the minimum or maximum unbounded. For example, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match zero to five instances. 

In general, curly brackets can help make our regular expressions shorter. For example, these two regular expressions match identical patterns:(Ha){3}, (Ha)(Ha)(Ha). As another example, these two also match identical patterns: (Ha){3,5}, and ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha)).

In [11]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHaHaHaHa')
print(mo1.group()) # HaHaHa
mo2 = haRegex.search('Ha')
print(mo2) # None

HaHaHa
None


We have seen different ways of expressing digits and characters. Now we need to expand our knowledge by introducing character classes, which can help us generalize the problem of finding phone numbers to all types of character matching problems. In the earlier phone number regex example, we learned that '\d' could stand for any numeric digit. That is, '\d' is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9). There are many such shorthand character classes in fact. Below is a summary table for the main notations we usually use:

  1. '\d':    Any numeric digit from 0 to 9.
  2. '\D':    Any character that is not a numeric digit from 0 to 9.
  3. '\w':    Any letter, numeric digit, or the underscore character (think of this as matching “word” characters).
  4. '\W':    Any character that is not a letter, numeric digit, or the underscore character.
  5. '\s':    Any space, tab, or newline character (think of this as matching “space” characters).
  6. '\S':    Any character that is not a space, tab, or newline.

Character classes are nice for shortening regular expressions. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5).

In [12]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens']

You can also create your own character classes. There are times when you want to match a set of characters but the shorthand character classes ('\d', '\w', '\s', and so on) are too broad. When faced with this type of problems, you can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

In [13]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('RoboCop eats pumpkins')

['o', 'o', 'o', 'e', 'a', 'u', 'i']

You can also include ranges of letters or numbers by using a hyphen. For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers. 

Now let's introduce another important notation, called the caret sign "^". By placing a caret character "^" just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class. For example:

In [14]:
poemRegex = re.compile(r'[^aeiouAEIOU]')
poemRegex.findall('I wandered lonely as a cloud.') # notice that the blank space ' ' also exists in the list

[' ',
 'w',
 'n',
 'd',
 'r',
 'd',
 ' ',
 'l',
 'n',
 'l',
 'y',
 ' ',
 's',
 ' ',
 ' ',
 'c',
 'l',
 'd',
 '.']

Speaking of the caret sign '^', you can also use the caret symbol at the start of a regex to indicate that a match must occur at the beginning of the searched text. Likewise, you can put a dollar sign at the end of the regex to indicate the string must end with this regex pattern. And you can use the '^' and '$' together to indicate that the entire string must match the regex—that is, it’s not enough for a match to be made on some subset of the string.

For example, the r'^Hello' regular expression string matches strings that begin with 'Hello'. For another example, the r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9. Similarly, if you put a plus sign between 'd' and the dollar sign, the program will match strings that both begin and end with one or more numeric characters.

In [15]:
beginsWithHello = re.compile(r'^Hello')
print(beginsWithHello.findall('Hello world!'))
print(beginsWithHello.findall('He said Hello'))

['Hello']
[]


In [16]:
endsWithNumber = re.compile(r'\d$')
print(endsWithNumber.findall('Your number is 42'))
print(endsWithNumber.findall('Your number is forty two.'))

['2']
[]


In [17]:
wholeStringIsNum = re.compile(r'^\d+$')
print(wholeStringIsNum.findall('1234567890'))
print(wholeStringIsNum.findall('12345xyz67890'))
print(wholeStringIsNum.findall('12 34567890'))

['1234567890']
[]
[]


Moving on to a new topic, we now need to study wild cards. In SAS, the asterisk/star sign "*" is used to denote wild card. In contrast, Python employs the dot/period sign to denote wild card. Below is an example:

In [18]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

Sometimes you will want to match everything and anything. For example, say you want to match the string 'First Name:', followed by any and all text, followed by 'Last Name:', and then followed by anything again. You can use the dot-star combination (.*) consecutively to stand in for that "anything". Remember that the dot character means "any single character except the newline" and the star character means "zero or more of the preceding character". It is also important note that the dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character.

Below are two examples of using the dot-star wild card:

In [19]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print('Firstname: ', mo.group(1))
print('Lastname: ', mo.group(2))
print('\n')

noNewlineRegex1 = re.compile('.*')
print(noNewlineRegex1.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group())
print('\n')
newlineRegex2 = re.compile('.*', re.DOTALL)
print(newlineRegex2.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group())

Firstname:  Al
Lastname:  Sweigart


Serve the public trust.


Serve the public trust.
Protect the innocent.
Uphold the law.


Normally, regular expressions match text with the exact casing you specify. But sometimes you care only about matching the letters without worrying whether they’re uppercase or lowercase. To make your regex case-insensitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile(). Below is an example:


In [20]:
robocop = re.compile(r'robocop', re.I)
print(robocop.search('RoboCop is part man, part machine, all cop.').group())
print(robocop.search('ROBOCOP protects the innocent.').group())
print(robocop.search('Al, why does your programming book talk about robocop so much?').group())

RoboCop
ROBOCOP
robocop


Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [21]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

'CENSORED gave the secret documents to CENSORED.'

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type \1, \2, \3, and so on, to mean "Enter the text of group 1, 2, 3, and so on, in the substitution". For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The "\1" in that string will be replaced by whatever text was matched by group 1—that is, the (\w) group of the regular expression.

In [22]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

There are many other functionalities within the 're' module. Being able to utilize the full capacity of the 're' module does take some time. Nevertheless, the reward is great whether you are a programmer for natural language processing or an experienced data scientist who needs to spend a lot of time taking care of strings. 