# Advanced Regular Expressions


Lesson Goals

    Gain a deeper understanding of the regular expressions.
    Use character and set matching to extract characters from text.
    Use meta-characters and quantifiers to write more robust regular expressions.
    Use character classes to make your regular expressions more concise.
    Combine regular expression components to match more complex patterns.
    Learn how to extract useful information from text such as words, capitalized words, quotes, and formatted numbers.

Introduction

Regular expressions are very useful for converting unstructured data to a structured format that is easier to analyze. In the String Operations lesson, we took a very introductory look at regular expressions and how to use them to search, split, and substitute characters in a text string. In this lesson, we will continue our study of regular expressions, going into more detail and learning how to perform more complex operations with them.

For this lesson, we will be working with the re library, so let's go ahead and import it. 

In [1]:
import re

In this lesson, we will work exclusively with the findall method of the re library because it allows us to extract every instance in the text that matches the regular expression patterns that we construct. Once you have a solid understanding of that, you can easily use those patterns for splitting or substituting strings as necessary.


Components of Regular Expressions

Regular expressions are patterns that we can put together using a combination of characters, sets, meta-characters, quantifiers, character classes, and groups. These expressions can be used to find patterns in text strings that match the expressions. In the sections that follow, we will delve into each of these components, describe them, and look at some examples of how they can be used.


# Characters and Sets

Regular expressions with single characters request a match for those characters. We simply need to put the individual characters in quotes, and the regular expression will extract instances in the text where the characters appear in sequence. The example below will find and return all the single character a instances in the text. 

In [2]:
text = "That person wears marvelous trousers."

pattern = 'a'
re.findall(pattern, text)

['a', 'a', 'a']

If we would like to extract a specific sequence of characters, we can just include them all in the quotes in the exact sequence we would like to find and extract them.

In [3]:
text = "That person wears marvelous trousers."

pattern = 'er'
re.findall(pattern, text)

['er', 'er']

In the example above, the regular expression matched and returned pattern 'er' from the words person and trousers in the text, as that is where they appear together in that specific sequence.

If we wanted to extract all instances of either letter e or r, we could enclose them in square brackets, making them a set. 

In [4]:
text = "That person wears marvelous trousers."

pattern = '[er]'
re.findall(pattern, text)

['e', 'r', 'e', 'r', 'r', 'e', 'r', 'e', 'r']

Making them a set will return every instance of any character within the square brackets.

You can embed these sets into regular character sequences to return things like versions of a word within the text where there are different accepted spellings.

In [5]:
text = "Is it spelled gray or grey?"

pattern = 'gr[ae]y'
re.findall(pattern, text)

['gray', 'grey']

If you have a range of characters, either alpha or numeric, that you want included in a set, you don't need to explicitly type out each one. You can simply put the beginning and end of the range inside square brackets with a hyphen (-) in the middle and it will return any characters it finds that are in the specified range.

In [6]:
text = "This is an A and B conversation, so C your way out of it."

pattern = '[A-C]'
re.findall(pattern, text)


['A', 'B', 'C']

In [7]:
text = "I'm not going to the party because 1) Karen is going, 2) I don't like her, and 3) I already have a headache."

pattern = '[1-3]'
re.findall(pattern, text)

['1', '2', '3']

You can use these up to their full ranges, listed below.

    [a-z]: Any lowercase letter between a and z.
    [A-Z]: Any uppercase letter between A and Z.
    [0-9]: Any numeric character between 0 and 9.



# Meta-characters

The square brackets we used to create sets in the previous section are an example of meta-characters, or characters that can be used in regular expressions to mean something other than what they actually are. These meta-characters allow us to concisely write and match more complex patterns. There are many kinds of these meta-characters, and below is a list of them and what they can be used for.

    []: Match set of characters
    .: Match any character except newline (\n)
    ^: Match characters not listed if within set or match beginning of line
    $: Match end of line
    |: Functions as an "OR" operator

Let's see how these meta-characters can be incorporated into our regular expressions and the impact they have on the results returned.

We have already seen an example of how square brackets work, so we will start with the . meta-character. 

In [8]:
text = "My boss asked me to turn in my TPS reports. \n I told him they were done, but they are not."

pattern = '.'
print(re.findall(pattern, text))

['M', 'y', ' ', 'b', 'o', 's', 's', ' ', 'a', 's', 'k', 'e', 'd', ' ', 'm', 'e', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'i', 'n', ' ', 'm', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'e', 'p', 'o', 'r', 't', 's', '.', ' ', ' ', 'I', ' ', 't', 'o', 'l', 'd', ' ', 'h', 'i', 'm', ' ', 't', 'h', 'e', 'y', ' ', 'w', 'e', 'r', 'e', ' ', 'd', 'o', 'n', 'e', ',', ' ', 'b', 'u', 't', ' ', 't', 'h', 'e', 'y', ' ', 'a', 'r', 'e', ' ', 'n', 'o', 't', '.']


As you can see, this returns every character except for the newline (\n) in our text.

Next, let's look at the how the ^ meta-character works. When at the beginning of a character set enclosed in square brackets, it returns all characters not in the set (all characters that are not in the range of lowercase a to lowercase m below). 

In [9]:
pattern = '[^a-m]'
print(re.findall(pattern, text))

['M', 'y', ' ', 'o', 's', 's', ' ', 's', ' ', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'n', ' ', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'p', 'o', 'r', 't', 's', '.', ' ', '\n', ' ', 'I', ' ', 't', 'o', ' ', ' ', 't', 'y', ' ', 'w', 'r', ' ', 'o', 'n', ',', ' ', 'u', 't', ' ', 't', 'y', ' ', 'r', ' ', 'n', 'o', 't', '.']


It can also be used outside sets to match a pattern at the beginning of a line.

In [10]:
pattern = '^My boss'
print(re.findall(pattern, text))

['My boss']


Similarly, the $ meta-character can be used to match a pattern at the end of a line. 

In [11]:
pattern = 'they are not.$'
print(re.findall(pattern, text))

['they are not.']


You might recognize the | meta-character as the OR operator from when we studied conditional statements. It functions the same way in regular expressions, matching a character or another character. 

In [12]:
pattern = 'boss|TPS|reports'
print(re.findall(pattern, text))

['boss', 'TPS', 'reports']


# Quantifiers

The next set of meta-characters are know as quantifiers because they help you repeat patterns a certain number of times.

    *: Matches previous character 0 or more times
    +: Matches previous character 1 or more times
    ?: Matches previous character 0 or 1 times (optional)
    {}: Matches previous characters however many times specified within:
        {n} : Exactly n times
        {n,} : At least n times
        {n,m} : Between n and m times

Let's take a look at how these behave when we write regular expressions that include them. In the example below, we use the * quantifier to match zero or more a's between the letters c and t. 

In [13]:
text = "The complicit cat interacted with the other cats exactly as we expected."

pattern = "ca*t"
print(re.findall(pattern, text))

['cat', 'ct', 'cat', 'ct', 'ct']


You will notice that the results include matches for cat where the a appears once as well as matches for ct where the a appears zero times. It does not match the cit in the word complicit because there is no a occurring zero or more times.

The + quantifier works similarly, but it returns matches where the previous character appears 1 or more times. In the example below, you'll see that it does not match ct if we swap out the * for a +. 

In [14]:
text = "The complicit cat interacted with the other cats exactly as we expected."

pattern = "ca+t"
print(re.findall(pattern, text))

['cat', 'cat']


The ? quantifier can be used when there is are instances where an optional character is sometimes included. In the example below, we place the ? after the u in our pattern so that we can match both spellings of the word.

In [15]:
text = "Is the correct spelling color or colour?"

pattern = "colou?r"
print(re.findall(pattern, text))

['color', 'colour']


If things are a little more complex and we need to deal with numbers besides 0's and 1's, we can use the curly brackets ( {}) to create custom quantifiers that repeat a character n times, at least n times, or a number of times between n and m.

In the example below, we are using the curly brackets with a 2 inside to match instances in the text where an a is followed by a w two times. 

In [16]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2}"
print(re.findall(pattern, text))

['aww', 'aww', 'aww', 'aww']


If we want to return instances where the * appears at least two times, we can place a comma after the 2.

In [17]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,}"
print(re.findall(pattern, text))

['aww', 'awww', 'awwww', 'awwwww']


As you can see, this returns the full text strings where the w's appear 2, 3, 4, and 5 times after the a.

If we would like to cap the number of w's returned at 4, we can add the number 4 after the comma. It will still match the same four results as in the previous example, but it will only include up to four w's in the results even if the string had five. 

In [18]:
text = "Let's see how we can match the following: aw, aww, awww, awwww, awwwww"

pattern = "aw{2,4}"
print(re.findall(pattern, text))

['aww', 'awww', 'awwww', 'awwww']


# Character Classes

There is one additional meta-character that we did not cover in the section above, and that is the backslash (). The backslash serves a couple of different purposes in regular expressions. It allows you to escape any meta-character (if you want to match *, +, or { characters for example), and it is also used to designate character classes. Character classes are like short-hand for longer patterns that are frequently used. Below is a list of the most commonly used ones and what they match.

    \w: Any alphanumeric character.
    \W: Any non-alphanumeric character.
    \d: Any numeric character.
    \D: Any non-numeric character.
    \s: Any whitespace characters.
    \S: Any non-whitespace characters.

As mentioned previously, these serve as short-hand for longer patterns. For example, the \w character class returns matches for any alphanumeric character. To do this without the /w character class, you would have to put the following set together: [a-zA-Z0-9_]. Let's take a look at how useful these character classes can be when incorporated into our regular expressions. 

In [19]:
text = "Th1s is going to_be a weird sentence with @ bunch-of-$tuff in it <3."

pattern = '\w'
print(re.findall(pattern, text))

['T', 'h', '1', 's', 'i', 's', 'g', 'o', 'i', 'n', 'g', 't', 'o', '_', 'b', 'e', 'a', 'w', 'e', 'i', 'r', 'd', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 'w', 'i', 't', 'h', 'b', 'u', 'n', 'c', 'h', 'o', 'f', 't', 'u', 'f', 'f', 'i', 'n', 'i', 't', '3']


This returned all the alphanumeric characters in our text. If we wanted to return all the non-alphanumeric characters, we could use the /W character class instead. 

In [20]:
pattern = '\W'
print(re.findall(pattern, text))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '@', ' ', '-', '-', '$', ' ', ' ', ' ', '<', '.']


If we wanted only the numeric characters, we could use the /d character class as follows. 

In [21]:
pattern = '\d'
print(re.findall(pattern, text))

['1', '3']


What about if we wanted to match everything but numeric characters? We could use the \D character class to do that. 

In [22]:
pattern = '\D'
print(re.findall(pattern, text))

['T', 'h', 's', ' ', 'i', 's', ' ', 'g', 'o', 'i', 'n', 'g', ' ', 't', 'o', '_', 'b', 'e', ' ', 'a', ' ', 'w', 'e', 'i', 'r', 'd', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 'w', 'i', 't', 'h', ' ', '@', ' ', 'b', 'u', 'n', 'c', 'h', '-', 'o', 'f', '-', '$', 't', 'u', 'f', 'f', ' ', 'i', 'n', ' ', 'i', 't', ' ', '<', '.']


We could do the same thing for whitespace and non-whitespace characters using the /s and /S character classes respectively. Give those a try on your own and see what gets returned.

Writing More Complex Regular Expressions

So far in this lesson, we have looked at the uses of regular expression components and the results they retrieve in isolation. This is useful, but often times you will need to combine various components into more complex regular expressions to extract the information you need to extract. In this section, we will look at several examples that combine these components to retrieve more useful information than just characters.



# Extracting Words from a Text

Often the information we need to extract from text does not consist of individual characters but of words or phrases that match some criteria communicated in the form of a regular expression. For example, we could use the following pattern to extract a list of all the words (all sequences of alphanumeric characters) in our text.

In [23]:
text = "If you tell the truth, you don't have to remember anything."

pattern = '\w+'
print(re.findall(pattern, text))

['If', 'you', 'tell', 'the', 'truth', 'you', 'don', 't', 'have', 'to', 'remember', 'anything']


In the example above, we used a character class ( \w) followed by a + to return alphanumeric character sequences of a length greater than or equal to 1.

When working with text, there is often a need to get remove words that are not meaningful. One way to define that would be based on word length. For example, most words that are three letters or less are very common (a, be, and, the, you, etc.) and do not add much value. We can remove such words by replacing the + in the previous example with a {4,}. 

In [24]:
pattern = '\w{4,}'
print(re.findall(pattern, text))

['tell', 'truth', 'have', 'remember', 'anything']


This removes some of the noise and allows us focus on words that have the potential to be more useful to any analysis we are looking to perform.


# Extracting Capitalized Words from Text

It is also often useful to extract proper nouns from text. This could tell you what people, places, and things the text is about. Proper nouns start with capital letters, so we could use that as part of our regular expression to extract them.

In [25]:
text = "TerraPower, a nuclear-energy company founded by Bill Gates, is unlikely to follow through on building a demonstration reactor in China, due largely to the Trump administration’s crackdown on the country."

pattern = '[A-Z][a-z]+'
print(re.findall(pattern, text))

['Terra', 'Power', 'Bill', 'Gates', 'China', 'Trump']


Grouping

In the example above, the pattern we are defining is looking for an uppercase letter followed by a series of one or more lowercase letters. This gives us the individual capitalized words, which is good, but ideally, TerraPower and Bill Gates should be grouped together. It turns out that we can improve upon this by using groups. Grouping is done by enclosing regular expression components that belong together within parentheses, and it allows us to create more complex regular expressions like the kind we would need to group consecutively capitalized words together. 

In [26]:
pattern = '([A-Z][a-z]+ ?[A-Z][a-z]+)|([A-Z][a-z]+)'
print(re.findall(pattern, text))

[('TerraPower', ''), ('Bill Gates', ''), ('', 'China'), ('', 'Trump')]


In this example, we have two groups in our regular expression. The first matches a capitalized first letter followed by a series of lowercase characters, an optional space, and then another capitalized first letter and series of lowercase characters (essentially two capitalized words optionally separated by a space). The second expression simply matches a single capitalized word, and the OR operator in between them specifies that one or the other should be returned if matched. You can see that the results are formatted as a list of tuples where the result is positioned as the first element if it matches the first grouped expression and the second element if it matches the second grouped expression.

If you didn't care for the nested structure and just wanted a single neat list with the results, you could use a list comprehension to obtain that as follows.

In [27]:
results = [i for j in re.findall(pattern, text) for i in j if i != '']
results

['TerraPower', 'Bill Gates', 'China', 'Trump']

You can see that this gets us part of the way toward what is known as Named Entity Recognition, which involves the extraction of proper nouns (names of people, places, things, etc.) from a body of text. NER is much more complex however and often involves both tagging words with their parts-of-speech and also some machine learning. This is beyond the scope of this lesson, but we will provide some coverage in a future lesson.


# Extracting Quotes from Text

Another useful application of regular expressions is extracting quotes from text. It is relatively straightforward to do this. We simply need to match all characters within a set of quotes in the text. Recall that we can use a dot (.) to retrieve any character except a newline (\n).

In [28]:
text = """
For eight young men the AP tracked down in Seattle, tech obsession has become something much darker, getting in the way of their normal lives.

"We’re talking flunk-your-classes, can’t-find-a-job, live-in-a-dark-hole kinds of problems, with depression, anxiety and sometimes suicidal thoughts part of the mix," the AP's Martha Irvine reports.
"""

pattern = '".*"'
re.findall(pattern, text)

['"We’re talking flunk-your-classes, can’t-find-a-job, live-in-a-dark-hole kinds of problems, with depression, anxiety and sometimes suicidal thoughts part of the mix,"']

The regex pattern above looks for double quotes in the text and then returns all the characters ( .) in between them where the series of characters can be of 0 or greater length ( *). The result is a list containing the quoted portion of our text. One important thing to note here is that regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.


# Extracting Formatted Numbers

The last set example we will look at in this lesson shows how regular expressions can be used for extracting formatted numbers from a body of text. Formatted numbers can include things such as phone numbers, social security numbers, or certain types of account numbers.

For example, let's say you had the following text containing phone numbers for several airlines and wanted to extract only the phone numbers.

In [29]:
text = """
Aeromexico 800-237-6639
Air Canada 888-247-2262
Air Canada Rouge 888-247-2262
Air Creebec 800-567-6567
Air Inuit 800-361-2965
Air North 800-661-0407
Air Tindi 888-545-6794
Air Transat 866-847-1112
Alaska Airlines 800-426-0333, 866-516-1685
Allegiant Air 702-505-8888
American Airlines 800-433-7300
Bearskin Airlines 807-577-1141
Buffalo Airways 867-874-3333
Calm Air 800-839-2256
Cape Air 800-227-3247
Delta Air Lines 800-455-2720
First Air 800-267-1247
Flair Airlines 204-888-2665
Frontier Airlines 801-401-9000
Harbor Air 800-665-0212
Hawaiian Airlines 877-426-4537
Horizon Air 800-547-9308
InterJet 866-285-8307
Island Air 800-388-1105
JetBlue 800-538-2583
Porter Airlines 888-619-8622
Silver Airways 801-401-9100
Southwest Airlines 800-435-9792
Spirit Airlines 801-401-2222
Sun Country Airlines 800-359-6786
Sunwing 877-SUN-WING
Thunder Airlines 800-803-9943
United Airlines 800-864-8331
Virgin America 877-359-8474
VivaAerobus 888-935-9848 
Volaris 855-865-2747
WestJet Airlines 888-937-8538
"""

You could do this easily by writing a regular expression consisting of /d+ character classes separated by hyphens. 

In [30]:
pattern = '\d+-\d+-\d+'
re.findall(pattern, text)

['800-237-6639',
 '888-247-2262',
 '888-247-2262',
 '800-567-6567',
 '800-361-2965',
 '800-661-0407',
 '888-545-6794',
 '866-847-1112',
 '800-426-0333',
 '866-516-1685',
 '702-505-8888',
 '800-433-7300',
 '807-577-1141',
 '867-874-3333',
 '800-839-2256',
 '800-227-3247',
 '800-455-2720',
 '800-267-1247',
 '204-888-2665',
 '801-401-9000',
 '800-665-0212',
 '877-426-4537',
 '800-547-9308',
 '866-285-8307',
 '800-388-1105',
 '800-538-2583',
 '888-619-8622',
 '801-401-9100',
 '800-435-9792',
 '801-401-2222',
 '800-359-6786',
 '800-803-9943',
 '800-864-8331',
 '877-359-8474',
 '888-935-9848',
 '855-865-2747',
 '888-937-8538']

These examples highlight just some of the scenarios where regular expressions really come in handy.