[Table of Contents](../../index.ipynb)

# FRC Analytics with Python - Session 13
# Working with Text, Part II
**Last Updated: 29 April 2021**

In the preceding session we studied string literals, escape sequences, string formatting, encoding, string slicing, and string methods. While a programmer can get a lot of work done with these techniques, they only go so far. If you wanted to search a block of text for chemical formulas or email addresses, you probably would need to write custom functions, and these custom functions would be difficult to debug.

In this session we'll study a text processing tool called *regular expressions*. Regular expressions can complete sophisticated text processing jobs, like identifying email address or chemical formulas, in just a few lines of code. It takes some effort to learn regular expressions, but they are available in most programming languages, command line tools, and in several relational databases. Once you learn them you can use them anywhere!

Finally, we'll cover some of the text processing tools that are available in the Pandas package. It's pretty common to have a text column in a dataframe and to need to conduct some sort of transformation or evaluation of that column.

## I. Introduction to Regular Expresions

### A. Our First Expression
First, let's create a long string on which we can try out some regular expressions (RE). 

In [None]:
shel_poem = """
I told my robot to do my biddin'
He yawned and said, "You must be kiddin'."
I told my robot to cook me a stew.
He said, "I got better things to do."
I told my robot to sweep my shack.
He said. "You want me to strain my back?"
I told my robot to answer the phone.
He said, "I must make some calls of my own."
I told my robot to brew me some tea.
He said, "Why don’t you make tea for me?"
I told my robot to boil me an egg.
He said, "First -– lemme hear you beg."
I told my robot, "There’s a song you can play me."
He said, "How much are you gonna pay me?"
So I sold that robot, 'cause I never knew
Exactly who belonged to who.
"""

To use regular expressions in Python we must first import the *re* module from the standard libary. We then create a pattern that specifies what we want to do, and finally we pass that pattern to a regular expression function.

In [None]:
# First regular expression example
import re

pattern = r"robot"

mtch = re.search(pattern, shel_poem)
mtch

In the preceding example, we created the regular expression pattern "robot" using a raw string. (We'll see why raw strings are useful later.) We then passed the pattern  and a poem by Shel Siverstein to the `search()` function from Python's `re` module. The search function finds the first span of text in the poem that matches the pattern, and returns the results as a `Match` object. The `Match` object tells us that the first occurrence of "robot" starts at position 11 and ends at position 16. Note that for `Match` objects, the first character is at position 1, not 0.

`Match` objects have several different methods and attributes, which are described in the [documentation for the `re` module](https://docs.python.org/3/library/re.html#match-objects). We can extract the position of the match with the `.span()` method, and we can extract the text that matched the pattern using list-style indexing.

In [None]:
mtch[0]

You might be wondering why we would use list style indexing to extract the text that was matched ("robot"). The answer is that it's possible for a `Match` object to contain more than one search result, which will be explained later.

If no match is found, `re.search()` returns `None`.

In [None]:
type(re.search("Python", shel_poem))

### B. So What?
If you are thinking "So what? We could have just used the `.find()` method," you are correct. Using the `.find()` method in this case has advantages. It only requires one line of code, we don't have to import the `re` module, and we don't have to deal with a weird `Match` object. If all you want to do is find the position of a string within another string, `.find()` is a good way to go.

In [None]:
# Using the .find() method
shel_poem.find("robot")

The answer to "So what?" is that we have not even scratched the surface of what regular expressions can do. To get a better understanding of their potential, we'll try another function from the `re` module. The `re.findall()` function finds ALL matches within a string.

In [None]:
# .findall() example
pattern = r"told"
re.findall(pattern, shel_poem)

The `.findall()` method returns the matches as a list. Now you're probably thinking "OK, that's a little bit more interesting, but I'm still not impressed." We can see that the word "told" occurs seven times within the poem, but the `.findall()` method does not provide the position of the matches.

Lets make it more interesting.

In [None]:
# Find all words that follow the word "to"
re.findall(r"to\s(\w+)\W", shel_poem)

What just happened? If you've never seen a regular expression before, then the pattern looks like gobbledygook. What it's doing is finding every word in the poem that follows the word "to".

Think about how you could accomplish the same task with the string methods that we've learned so far. You could use the `.find()` method to find the position of each occurrence of the word "to", then scan the text starting three characters after the "t" in "to" to figure out where the next word ends, then finally you could use string slices to extract the word. You might choose to write a custom function for this task. But with a regular expression, we accomplished the task in one line of code. This example hints at the power of regular expressions.

### C. Pattern Matching Overview
At a high level, a regular expression function attempts to match a pattern to a string of text. The function returns information about the match if one is found, or reports if no match is found.

Suppose we run the following line of code.  
```re.search(r"to ", "I told my robot to do my biddin")```

1. The pattern consists of the three characters 't', 'o', and ' '. None of these characters have special meaning. Characters that are just letters, digits, or white space will match the same character in the string to be searched.
2. The `re.search()` function gets the first character from the string to be searched and compares it to the beginning of the pattern. The first character in the string is 'I' and the first character of the pattern is 't'. 'I' does not match 't'.
3. Next, the search function retrieves the second character from the string, which is a space (' ').  A space also does not match 't'.
4. Next, the search function returns the third character from the string, 't'. We now have a partial match. The search function will attempt to match the rest of the pattern, starting at position 3 in the search string.
    * The search function extracts the next character from the string, 'o' and sees if it matches the next character in the pattern, 'o'. They match.
    * The search function extracts the next character from the string, 'l' and compares it to the third character in the pattern, which is a space. Since they don't match, the search function gives up on matching at this location.
5. Since the prior match attempt failed, the function goes back to the beginning of the pattern. It attempts to match the 4th character, 'o' to the beginning of the pattern, 't'. The function will proceed in this manner until it either finds a full match or reaches the end of the string.
6. We can see that the function will successfully match the pattern starting at string position 17, where the character sequence "to " occurs. At this point the search function stops and returns a match object.

### D. Analyzing a More Complex Pattern
Now let's reconsider the statement `re.findall(r"to\s(\w+)\W", shel_poem)`. The pattern in this statement includes two regular characters and several sequences and symbols that have special meaning. Let's break it down.
* The first two pattern characters, "to" have no special meaning and will only match the character sequence "to" in the search string.
* The sequence "\s" is called a character class. This specific character class ("/s") will match any single  white space character, such as a space, a tab, or a newline. This means the search string must contain the word "to" followed by a space or newline to get a match -- it won't match a word like "told".
* The next character in the pattern is an open parenthesis, which is paired with a closed parenthesis a few characters later. Ignore the parentheses for now -- we'll explain their purpose later.
* The sequence "\w" is another character class that matches any Unicode word character. "\w" will match the characters A through Z and z through z. It will also match characters from other alphabets that appear in words in other languages, such as 
&#946;, &#12363;, or &#2326;.
* The "+" sign modifies the preceding character class. It specifies that "\w" should match one *or more* Unicode word characters.
* Finally, the character class "\W" (note the *capital* W) is the negation of "\w". It matches any character that is *not* a Unicode word character.

In summary, to get a match, the string must contain the character sequence "to ", followed by one or more letters, and ending with any character that is not a letter.

The results generated by this pattern did not include the word "to", which occurs at the beginning of every match, or the non-letter character that occurs at the end of every match. Why is that? It has to do with the parentheses. Let's run the statement again, but without the parentheses.

In [None]:
re.findall(r"to\s\w+\W", shel_poem)

Now the `re.findall()` function returns the entire match. The parentheses that we just removed defined a capturing group. When parentheses are included in the pattern, the `re.findall()` method returns only the text that matches the portion of the pattern inside the parentheses. Patterns can define more than one group. For example:

In [None]:
re.findall(r"to\s(\w+)\W(\w*)\W", shel_poem)

By including a second capturing group, this pattern returns the next two words after the word "to". The groups for each match are assembled into a tuple. 

Note that the second capturing group replaced '+' with '\*'. The asterisk (\*) causes the preceding character or character class to match *zero* or more occurrences, as opposed to the plus sign's *one* or more occurrences. How would the results be different if we used '+' in the second capture group instead of '\*'? How many matches would be returned?

Do you see why we are using raw strings (strings prepended with "r") to define our patterns? In regular strings, the backslash "\\" normally indicates an escape sequence. If we didn't use raw strings, we would have to escape every backslash with another backslash. Our most recent pattern would have looked like this: `"to\\s(\\w+)\\W(\\w*)\\W"`. Regular expression patterns are already confusing. Let's not make them more so by doubling the backslashes -- you are encouraged to use raw strings to enhance the readability of regular expression patterns.

## II. Regular Expression Tokens
Each part of a regular expression pattern is called a token. Tokens are usually comprised of a single character, but in some cases they can be two or more characters long. There are several types of tokens. This notebook explains a few of the most commonly used tokens. For additional tokens, refer to the [documentation for the `re` module](https://docs.python.org/3/library/re.html) or to [this excellent and comprehensive regular expression tutorial](https://www.regular-expressions.info/tutorial.html).

### A. Literal Characters
The simplest token is a literal character. Every character that is not a special character is a literal character. There are twelve special characters: `\^$.|?*+()[{`. Therefore all letters, numbers, white space characters, and many punctuation marks are literal characters. A literal character token matches the same character in a string. The pattern from the first regular expression example, "robot" consisted entirely of literal characters.

### B. Character Classes
The second type of token which we will discuss is character classes.

#### 1. Predefined Classes
We already discussed "\w", which matches Unicode word characters, "\W", which matches any character that is not a character that can be used in a word, and "\s" which matches any white space character. Here are some additional character classes. Here are some additional character classes.
* "\S" matches any character that is *not* white space.
* "\d" matches any Unicode decimal digit.
* "\D" matches any character that is not a Unicode decimal digit.
* "\b" matches an empty string when it occurs at the beginning or end of a word.
* "\B" matches an empty string when it occurs anywhere other than at the beginning or end of a word.

"\b" and "\B" are a bit tricky to understand. We can rewrite our "to \_\_\_\_\_" example using "\b".

In [None]:
re.findall(r"to\s(\w+)\b", shel_poem)

"\b" effectively matches word boundaries. To understand "\B", take a look at this example from Python's Standard Library documentation.

In [None]:
re.findall(r"py\B\S*", "python py3 py2 py py. py!")

The sequence "py" is only matched if it is followed by characters that do not indicate the end of a word.

#### 2. User Defined Classes
It's very easy to define your own character classes within regular expression patterns. Square brackets (`[]`) are used to define custom character classes.

In [None]:
colors = """
dark salmon	#E9967A	(233,150,122)
 	salmon	#FA8072	(250,128,114)
 	light salmon	#FFA07A	(255,160,122)
 	orange red	#FF4500	(255,69,0)
 	dark orange	#FF8C00	(255,140,0)
 	orange	#FFA500	(255,165,0)
 	gold	#FFD700	(255,215,0)
 	dark golden rod	#B8860B	(184,134,11)
 	golden rod	#DAA520	(218,165,32)
 	pale golden rod	#EEE8AA	(238,232,170)
 	dark khaki	#BDB76B	(189,183,107)
 	khaki	#F0E68C	(240,230,140)
"""

# Extract all hex numbers from text
ptn = "#([\da-fA-F]+)"
print(re.findall(ptn, colors))

The portion of the pattern contained within the square brackets will match any character that is a digit, the lowercase letter 'a' through 'f', or the uppercase letter 'A' through 'F'.
* Digits are matched due to the presence of the character class "\d".
* Upper and lower case letters are matched due to the ranges "a-f" and "A-F".

Individual characters (not ranges) can be included in character classes by including just the character within the square brackets. For example, the following pattern extracts all words containing only the letters I, R, and S from a text file of all legal Scrabble words. The Scrabble word file is plain text file with all words in upper case and each word separated by a newline character. Note that in this example we are using "\b" to denote word boundaries.

In [None]:
with open("data/Collins Scrabble Words.txt", "rt") as sfile:
    scrabble_words = sfile.read()
    
print("First few lines of Scrabble dictionary file:")
print(scrabble_words[:29])

# Find all words containing only I, R, or S.
print("Words that contain only I, R, or S:")
print(re.findall(r"\b[IRS]+\b", scrabble_words))

If the first character within the square brackets is '^', then the character class consists of all characters that are *not* contained within the square brackets. For example, the following expression finds all legal scrabble words that do not contain any vowels.

In [None]:
# Legal Scrabble words that do not contain vowels.
print(re.findall(r"\b[^AEIOUY\s]+\b", scrabble_words))

The hyphen (-) and closing bracket (]) become special character inside a character class. The backslash (\\) and caret (^) remain special, but all other special characters are treated as literal characters. This means you could create a character class that searches for plus signs and asterisks: `[+*]`. To treat a hyphen, closing bracket, or backslash as a literal character, precede them with a backslash: `[\^\]\\]`.

### C. Dot and Slash
The period (.), which is commonly called a *dot* when it's used within regular expression patterns, matches any character other than a newline ("\\n") character. Use it sparingly. Since it can match anything, it's easy to get inadvertent, unanticipated matches. You can often use a predefined or custom character class in place of the dot.

The backslash (\\) causes subsequent special characters to be treated as literal characters. If you wanted to match a literal period (.), you could include `\.` in your pattern.

### D. Quantifiers
Quantifiers modify how the preceding token is matched. We've already seen two different quantifiers:
* The asterisk (\*) causes the preceding token to match *zero or more* occurrences.
* The plus sign (+) causes the preceding token to match *one or more* occurrences.

The remaining quantifiers are:
* The question mark (?) causes the preceding token to match *zero or one* occurrences.
* `{m}` matches exactly *m* copies of the preceding token, where *m* is an integer.
* `{m, n}` matches from *m* to *n* repetitions, where both *m* and *n* are integers.

The following example matches internet protocol (IP) addresses.

In [None]:
IPs = "1.0.0.255 1.32.232.0 1.32.239.2555 2.16.40.0 2.16.41.255".split(" ")

ptn = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"
[re.fullmatch(ptn, ip) for ip in IPs]

IP addresses consist of four numbers separated by periods, with each number ranging from 0 to 255. The pattern verifies that each number consists of between 1 and 3 digits and that each number is separated by a period. This pattern would not identify an invalid IP address that included numbers larger than 255, but it's a start. Note how we had to use a backslash to force the period to be treated as a literal character.

The quantifiers `?`, `*`, and `+` each have a non-greedy version, `??`, `*?`, and `+?`. By default, `?`, `*`, and `+` try to match as large a string as possible.

In [None]:
txt = "<title>This is a Webpage Title</title>"

# Attempt to match only the HTML tag
print("Greedy Matching: ", re.search(r"<.+>", txt)[0])
print("Non-greedy Matching: ", re.search(r"<.+?>", txt)[0])

The first attempt to match the HTML tag matches the start tag, end tag, and the content of the tag, which is not what we want. This happens because the entire string matches the dot (`.`) and the `+` quantifier tries to retrieve as large a match as possible. Adding the question mark (`?`) gives us a non-greedy pattern.

### E. Anchors
Sometimes we only want a regular expression to match if the pattern occurs at the beginning or end of a string. Anchor are tokens that match specific locations within a string.
* The `^` token matches the beginning of a string (or line if MULTILINE flag is set).
* The `$` token matches the end of a string (or line if MULTILINE flag is set).
* The `\A` token matches the begining of a string, but not the beginning of a line that occurs in the middle of a string.
* The `\Z` token matches the end of a string, but not the end of a line that occurs in the middle of the string.

Here's an example:

In [None]:
# Get all lines from robot poem that end with a quote.
re.findall(r'^.*"$', shel_poem, re.MULTILINE)

This example finds all lines from the poem that end in a quotation. Note that we had to pass the `re.MULTILINE` flag to the `re.findall()` function for the `^` and `$` tokens to match at end beginning and end of lines. The pattern also used the dot token and a quantifier token.

In [None]:
print(re.M)

### F. Groups
We've already seen an example of groups. Groups are defined with parentheses and they instruct the regular expression engine to return the sub-string that matches the portion of the pattern within parentheses. Consider the following example.

In [None]:
print("15 character adverbs that start with B:\n", re.findall(r"\bB\w{12}LY\b", scrabble_words))
print()
print("Adverb roots:\n", re.findall(r"\b(B\w{12})LY\b", scrabble_words))

The first pattern identified 15-character adverbs that start with the letter B. The second pattern used a capture group to extract the root word from the adverbial form.

### G. Back-references
Back-references are tricky. It's best to start with an example. The following pattern finds all legal Scrabble words that are four characters long and that repeat the first two letters.

In [None]:
# Find all legal four letter scrabble words that repeat the first two letters.
matches = list(re.finditer(r"\b(\w)(\w)\1\2\b", scrabble_words))
print([m[0] for m in matches])

The tokens `\1` and `\2` are back-references. `\1` matches the first named group in the pattern, and `\2` matches the second named group. The end result is that four-character words where the 3rd letter is the same as the 1st and the 4th letter is the same as the second match the pattern. Back-references can range from `\1` to `\99`.

We used a new regular expression function in this example because the `re.findall()` function doesn't work well in this instance. Here's what `re.findall()` would have returned:

In [None]:
print(re.findall(r"\b(\w)(\w)\1\2\b", scrabble_words))

If groups are defined in the pattern, `re.findall()` returns a tuple with all groups. There is no way to use a back-reference without defining a group, so `re.findall()` doesn't work in this case because we want to extract the entire word. `re.finditer()` returns an iterator of `Match` objects. We have not covered Python iterators. For now, you just need to know that if an iterator is passed to the built-in `list()` function, the iterator is converted to a list. So calling `list(re.finditer())` returns a list of `Match` objects, or `None` if no match is found. We used a list comprehension to extract the matched text from each `Match` object.

### H. Alternation
The alternation token is the vertical bar `|`, which is sometimes called a pipe character. The alternation token causes the pattern to match the portion of the pattern on the right, or the portion of the pattern on the left. The following example matches the word CAT or the word DOG.

In [None]:
print(re.findall(r"\bCAT\b|\bDOG\b", scrabble_words))

The alternation token is often used with parentheses. The purpose of the parenthesis is not to define a group, but to limit the effect of the alternation token. Here is a pattern that finds all words with either CAT or DOG appearing somewhere in the middle of the word.

In [None]:
# All legal 7-letter Scrabble words that contain CAT or DOG in the middle of the word.
cat_dog_words = re.findall(r"\b\w+(?:CAT|DOG)\w+\b", scrabble_words)
print([word for word in cat_dog_words if len(word) == 7])

The pattern would not do what we want without the parentheses. The parentheses specify that all words must start and end with at least one word character that is not part of the sequence CAT or DOG. Starting the group with `?:` creates a non-capturing group, which is why `re.findall()` returns the entire word instead of multiple copies of DOG or CAT.

### I. Flags
Parameters can be placed at the beginning of regular expression patterns. The parameters modify how the pattern is matched to the string. The parameters consist of a question mark followed by a one or more letters in parentheses. These parameters are usually called flags or mode modifiers.

|Flag | Name | Description |
|-----|-------------------|-------------|
|`(?a)` | ASCII mode | Causes the character classes like `\w` and `\b` to only match ASCII characters|
|`(?i)` | IGNORECASE | Causes matches to be case insensitive. |
|`(?L)` | LOCALE     | Consider the computer's current locale setting for `\w`, `\W`, `\b`, and `\B` |
|`(?m)` | MULTILINE | Causes `^` and `$` to match the beginning and end of lines within a longer string |
|`(?s)` | DOTALL | Causes the dot `.` to match all characters, including newlines |
|`(?x)` | VERBOSE | Ignores whitespace and allows comments within patterns |

Different flags can be combined. For example the patter `r"(?ai)robot"` will match only ASCII characters and will ignore case.

In [None]:
# Find all lines that end with "me.."
print(re.findall(r"(?m)^(.*me\W\"$)", shel_poem))

Flags can also be passed to most functions. The flags are passed in the form of special objects:
* `re.ASCII`, `re.A`
* `re.IGNORECASE`, `re.I`
* `re.LOCALE`, `re.L`
* `re.MULTILINE`, `re.M`
* `re.DOTALL`, `re.S`
* `re.VERBOSE`, `re.X`

There is no difference between using the short or long version. Flags can be combined with the bitwise OR operator `|`.

In [None]:
# Find all words that start with F, regardless of case.
# Combine different flags with bitwise OR
print(re.findall(r"\bf\w*\b", shel_poem, re.IGNORECASE | re.MULTILINE))

The VERBOSE flag is interesting. It allows white space and comments in regular expression patterns. For example, `verbose_ptn` is a legal regular expression pattern. The white space and comments have no effect on the matches.

In [None]:
verbose_ptn = r"""
    \b(\w)(\w)   # The first two characters of a 
    \1\2\b"      # The 3rd and 4th character must match the 1st and 2nd.
"""

## III. Regular Expression Functions
The `re` module includes several different functions for matching patterns to text.

### A. `re.search()`
We've already seen examples of `re.search()`. It searches through the string starting at the beginning and returns a `Match` object for the first match found, or `None` if there is no match.

### B. `re.match()`
`re.match()` is similar to `re.search` but it only returns a match if the beginning of the string matches the pattern.

In [None]:
dad_joke = "I was reading a book on helium. I couldn’t put it down."

print(re.search("helium", dad_joke))
print(re.match("helium", dad_joke))
print(re.match("I was", dad_joke))

### C. `re.compile()`
Before we discuss what the `re.compile()` function does, we need to cover some background info. Before a regular expression patterns is used, it must be converted to a structure called a non-deterministic finite automota (NFA). This conversion is called compiling. When you pass a string to a function like `re.search()` or `re.findall()`, the pattern is automatically compiled.
> **Note:** Actually, some languages and applications (e.g. PERL, PostgreSQL) use deterministic finite automata (DFA) instead of NFAs, but that topic is *way* beyond the scope of this notebook. It's not necessary to understand NFAs to use regular expressions, but if you must know, you can Google it.

The `re.compile()` function takes a regular expression pattern and returns a compiled regular expression. You can pass the compiled regular expression to methods like `re.search()`, or you can call equivalent methods on the regular expression.

In [None]:
# Compiling a regular expression to find all 2 and 3-letter
#   words that start with Q
compiled_pattern = re.compile(r"\bQ\w{1,2}\b")

# Call a method on the compiled expression
print(compiled_pattern.findall(scrabble_words))

# Pass the compiled expression to a function in `re`
re.search(compiled_pattern, scrabble_words)

The arguments are slightly different when calling `.search()` as a method on the compiled pattern than when using the module level `re.search()` function. The `re.search()` function accepts a *flags* argument (see section II.I) but the `.search()` method does not. On the other hand, the `.search()` method accepts two optional, integer arguments, `pos` and `endpos`, that specify where in the string the method should start searching for a match. There are similar differences in arguments for the other regular expression functions and methods, including `findall()`, `match()`, `finditer()`, and `fullmatch()`.

Theoretically, compiling a regular expression pattern that you intend to use many times could speed up your program, because the expression would only be compiled once. In practice it makes little difference whether you pre-compile a regular expression, regardless of how many times you use it. When Python compiles a regular expression, it also caches the expression. If you use the same pattern again, Python will pull the compiled expression from the cache.

### D. `re.fullmatch()`
`re.fullmatch()` is the same as `re.match()`, except that the pattern must match the entire string.

In [None]:
stats_meme = "Statistics means never having to say you're certain."
print(re.fullmatch("S.*[^.]", stats_meme))
print(re.fullmatch("S.*\.", stats_meme))

### E. `re.findall()` and `re.finditer()`
We've already seen several examples of `re.findall()`, which returns a list of all matches that occur in a string. If the pattern includes a group, `re.findall()` returns a list of the group matches, or a string of tuples of matches if the pattern includes more than one group.

There are two significant differences between `re.finditer()` and `re.findall()`:
* `re.findall()` returns a list, but `re.finditer()` returns an iterator. Iterators can be used in many of the same places as lists, such as *for* loops and inside list comprehensions. They can also be converted to lists by passing them to the built-in `list()` function.
* The list returned by `re.findall()` contains strings, but `re.finditer()` provides `re.Match` objects. Consequently, `re.finditer()` can provide more information than `re.findall()`, including position of the match within the string, the full match, and all group matches.

Here is an example of the differences between `.findall()` and `.finditer()`.

In [None]:
# Use re.findall() to find all words in Silverstein poem that end in 'I'
ptn = re.compile(r"\bI (\w+)\b")
print(ptn.findall(shel_poem))

In [None]:
# Use re.finditer()
print(ptn.finditer(shel_poem))

The `.findall()` method provided a list of matches, as expected. But printing the iterator object from `.finditer()` did not provide useful output, because an iterator object is not the same as a list. We'll cover iterators in greater detail later. For now, just convert them to a list or use them in a list comprehension.

In [None]:
# Converting iterator to a list
matches = list(ptn.finditer(shel_poem))
matches

In [None]:
# Extract data from match objects
[(mtch.span(), mtch[0], mtch[1]) for mtch in matches]

Each match object contains two match results. `mtch[0]` returns the string that matches the *entire* pattern, and `mtch[1]` returns the string that matches the first matching group (part of the pattern in parentheses). If there were a second matching group, we would retrieve the matching text with `mtch[2]`, and so on.

### F. `re.split()`
`re.split()` returns a list of strings created by splitting the string at every match.

In [None]:
# Splitting the Siverstein poem into substrings whereever there is whitespace.
ptn = re.compile(r"\s")
print(ptn.split(shel_poem))

### G. `re.sub()`
The `re.sub()` function is very useful. It replaces every match with a string that you provide.

In [None]:
print(re.sub(r"He", "She", shel_poem))

## IV. Regular Expression Exercises

**Ex IV.1.** Find all legal scrabble words that start and end with the letter 'U'.
* There are eight words that meet this criteria.
* Remember, regular expressions are case-sensitive by default

In [None]:
# Ex IV.1



**Ex IV.2.** What four legal Scrabble words start and end with the same four letters, with the ending four letters in reverse order compared to the beginning four letters?
* This task uses back-references and capture groups, so the `.str.findall()` method won't work. Instead, generate a list of the four words by placing `.str.findter()` in a list comprehension.

In [None]:
# Ex IV.2



**Ex IV.3.** What is the mean molecular weight of all compounds listed in the text below?
* Use a regular expression to extract the molecular weights
* Use a list comprehension to convert the molecular weight strings to floating point values.
* There are a couple built-in Python function that will help you calculate the mean.

In [None]:
# Data for exercises IV.3 - 6
chem_txt = """
Compound name	Molecular weight	Molecular formula
1	Acetic acid	60.052 g/mol	CH3COOH
2	Hydrochloric acid	36.458 g/mol	HCl
3	Sulfuric acid	‎98.072 g/mol	H2SO4
4	Acetate	59.044 g/mol	CH3COO
5	Ammonia	17.031 g/mol	NH3
6	Nitric acid	63.012 g/mol	HNO3
7	Phosphoric acid	97.994 g/mol	H3PO4
8	Sodium phosphate	119.976 g/mol	Na3PO4
9	Calcium carbonate	100.086 g/mol	CaCO3
10	Ammonium sulfate	132.134 g/mol	(NH4)2SO4
11	Carbonic acid	62.024 g/mol	H2CO3
12	Sodium bicarbonate	84.0066 g/mol	NaHCO3
13	Sodium hydroxide	39.997 g/mol	NaOH
14	Calcium hydroxide	74.092 g/mol	Ca(OH)2
15	Ethanol	46.069 g/mol	C2H5OH
"""

In [None]:
# Ex IV.3



**Ex IV.4.** Try to write a regular expression that extracts all individual elements from the chemistry text in exercise IV.3.
* The word *try* is used here because it's not possible to extract all elements from the text using the `re` module from the Python Standard Library, but see how many you can get.
* Return the elements as a list. Pay close attention to which elements are missing. That might help you answer one of the quiz questions.

In [None]:
# Ex IV.4



**Ex IV.5.** Extract all of the *molecular formulas* from the chemistry text. It's challenging, but you should be able to extract all of these. Here are some hints.
* Within a molecular formula, how do you tell where a new element symbol starts?
* How do you match literal parentheses in a regular expression?
* You will need to use some non-capturing groups, which start with a question mark and colon just inside the opening parentheses: `(?: ... )` Non capturing groups are useful when combined with the repetition tokens.
* Use the backslash to match literal parentheses that occur within the molecular formulas.

In [None]:
# Ex IV.5



**Ex IV.6** Have you ever played Scrabble and gotten stuck with a 'q' near the end of the game, with no 'u' in site? Find all of the words in the Scrabble dictionary that contain a 'q' but do not contain a 'u'. There are 85 such words.

In [None]:
# Ex IV.6



## V. Working with Text in Pandas
Many of the techniques we've already covered can be applied to Pandas data frames. To illustrate how we can manipulate text in data frames, we'll load a data frame that contains all 7,475 FIRST Robotics Competition teams.

### A. Load Data Frame

In [None]:
import pandas as pd

# Load data on all FRC teams
import pickle
with open("data/frc_teams.pickle", "rb") as pfile:
    teams = pickle.load(pfile)
    
print("Number of teams:", teams.shape[0])
print("First few rows of dataframe:")
teams.head()

### B. String Method Example
Suppose we would like to list all teams that have the word "penguin" in their name.

In [None]:
# Penguins anyone?
teams[teams.nickname.str.contains(r"penguin", case=False)]

Pandas provides about fifty different methods for manipulating text. [The entire list is here.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#method-summary) The methods are accessible via the `.str` attribute of Pandas `Series` objects. I expect that sounds a bit confusing, so we'll review a few facts about data frames in the next section.

### C. Review of Series and Boolean Indexing
A Pandas `DataFrame` consists of one or more columns. The individual columns are Pandas `Series` objects. There are several ways to extract a single column as a `Series` object. We can place the column name as a string within square brackets. this always works.

In [None]:
# Extracting Series from DataFrame
teams["nickname"].head()

If the column name is a legal Python identifier (e.g., variable or function name), we can extract the `Series` using an object attribute type notation.

In [None]:
# Using attribute notation to extract Series from DataFrame
teams.nickname.head(8)

The `.str.contains()` method takes a regular expression pattern and attempts to match the pattern to every element of the `Series`. It returns a Boolean `Series` object with `True` and `False` values that indicate whether a match was found.

In [None]:
# Which rows contain "Robo"?
teams.nickname.str.contains("Robo").head(8)

If we place a Boolean `Series` within square brackets after a `DataFrame` variable, and the `Series` object is the same length as the `DataFrame`, the data frame will be filtered to contain only the rows that correspond to elements of the `Series` that are `True`.

In [None]:
teams[teams.nickname.str.contains("(?i)robo")].iloc[0:5, 0:7]

### D. Simple String Functions

#### 1. Case Functions
The following methods modify a column's case. Note that they have the same name as the corresponding Python string methods.
* `.str.lower()`
* `.str.upper()`
* `.str.title()`
* `.str.capitalize()`
* `.str.swapcase()`
* `.str.casefold()`

In [None]:
# Convert column to upper case
print("Use upper() to convert a column to upper case:")
print(teams.nickname.str.upper().head())
print()

# Swap the case
print("Use swapcase() swap upper and lower case letters:")
print(teams.nickname.str.swapcase().head())

#### 2. String Alignment
The following methods pad strings on the left, right, or both sides with a fill character, so that all strings are the same length. If not specified, the fill character is a space.

* `str.ljust()`
* `str.rjust()`
* `str.center()`
* `str.pad()`

In [None]:
# String alignment
print("Left Justified Strings: str.ljust()")
print(teams.city.str.ljust(20).head(5))

print("\nCentered Strings: str.center()")
print(teams.city.str.center(20, "-").head(5))

print("\nRight Justified Strings.str.rjust()")
print(teams.city.str.rjust(20, ".").head(5))

#### 3. String Testing
The following methods can be used to test the content of string columns.
* `.str.isalpha()`
* `.str.isnumerica()`
* `.str.isalnum()`
* `.str.isdigit()`
* `.str.isdecimal()`
* `.str.isspace()`
* `.islower()`
* `.isupper()`
* `.istitle()`

In [None]:
print("Number of teams with lower case names:", len(teams[teams.nickname.str.islower()]))
print("\nExample teams:")
teams[teams.nickname.str.islower()].iloc[0:5, 0:7]

### E. Splitting and Joining Strings

The `.str.split()` function splits a string into multiple substrings. To show how this method can be helpful, we'll take a look at the *name* column in our teams data frame.

In [None]:
pd.set_option('display.max_colwidth', None)   # Keeps name column from being truncated
teams[["key", "nickname", "state_prov", "name"]].head()

For many teams, the name column is a long list of sponsors jammed together with slashes. Let's split the column on the backslash character.

In [None]:
teams["split_name"] = teams.name.str.split("/")
teams[["key", "nickname", "state_prov", "split_name"]].head()

After the split, all of the string values are separated by commas and contained within square brackets. The brackets and commas are not just for formatting.

In [None]:
print("Value of split_name column for first row:\n\t", teams.split_name[0])
print("Value Type:", type(teams.split_name[0]))

The `.str.split()` function actually turns the contents of the column into a list of strings! We could also expand the values into separate columns of a data frame by setting the `expand` argument to `True`. For ease of display, we set the maximum number of splits to five.

In [None]:
# Create a dataframe with columns for each substring
teams.name.str.split("/", n=5, expand=True).head()

The opposite of `.str.split()` is `.str.join()`, which joins the elements of lists within a `Series` object into a single string. It works similarly to the string class's `.join()` method.

Another useful function is `.str.cat()`, which can join strings from different columns. We could use `.str.cat()` to create a single column with the city, state, and country.

In [None]:
# The cat() method joins two or more text columns into one column.
print("Using the cat() method")
print(teams.city.str.cat([teams.state_prov, teams.country], sep=", ").head())

### F. Regular Expressions in Pandas
Several Pandas string methods accept regular expressions, including the `.str.contains()` method that we saw earlier. It returns a Boolean `Series` object based on whether the regular expression pattern is matched anywhere within the string. The following example finds the first ten teams whose name contains a four-letter word with a doubled vowel.

In [None]:
ptn = re.compile(r"""
    (?ix)         # Make the pattern case-insensitive and verbose
    \b\w          # First letter of word can be any character
    ([aeiou])     # 2nd letter must be a vowel
    \1            # 3rd character must be the same as the 2nd.
    \w\b          # Final character can be any letter.
""")

teams[teams.nickname.str.contains(ptn)].iloc[:10, :7]

The example shows that Pandas text methods will accept compiled and verbose patterns.

The `.str.match()` is similar to `.str.contains()`, but it only returns `True` if the pattern matches the beginning of the string. The `.str.fullmatch()` method requires the entire string to match. The `.str.extract()` method will extract matched groups into columns of a data frame.

## VI. Pandas Text Exercises

**Ex VI.1.** Are cats or dogs more popular in FRC?  Use a Pandas text function to extract all teams with "cat" in their name (use the *nickname* column) and all teams with "dog* in their name. Which list is longer?

In [None]:
# Ex VI.1



**Ex VI.2.** Find the five teams with the most sponsors. Display a data frame with the number of sponsors for these five teams, in descending order by number of sponsors. Also include the key, nickname, state_prov, and name columns.
* Assume the number of sponsors is the number of organizations in the *name* column, where the organizations are separated by slashes.
* Use a list comprehension to create a list with the number of sponsors for each team, and then add that list to the `teams` data frame as an additional column.
* The documentation for [Pandas `.sort_values()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values) might be useful.

In [None]:
# Ex VI.2



## VII. Quiz
Enter your answers as comments in the code cells below each question.

**#1.** What type of data does the `r"(?i)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b"` match? Convert the pattern to a verbose pattern, with a comment explaining what each section does.

In [None]:
#1.


**#2.** What type of data does the `r"(?i)<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>"` match? What does each capturing group capture? Convert the pattern to a verbose pattern, with a comment explaining what each section does.

In [None]:
#2


**#3.** Explain the difference between `re.match()`, `re.search()`, and `re.fullmatch()`.

In [None]:
#3


**#4.** What is the difference between the patterns `r"\b\w+(?:CAT|DOG)\w+\b"` and `r"\b\w+(CAT|DOG)\w+\b"`?

In [None]:
#4


**#5.** Suppose we wanted to search for a match within the second half of a string? How could we do that without using string slicing?

In [None]:
#5


**#6.** How can we retrieve the *position* of a match within a string?

In [None]:
#6


**#7.** In exercise IV.4, we noted that it's not possible to match every chemical element that occurs in the string with the Python Standard Library's `re` module. Why? 
> Hint: Go back to exercise IV.4 and pay attention to which elements are *not* matched. Then carefully read the documentation for the `re.findall()` function in the [official documentation for the `re` module](https://docs.python.org/3/library/re.html).

## VIII. Save Your Work
Once you have completed the exercises, save a copy of the notebook outside of the git repository (outside of the *pyclass_frc* folder). Include your name in the file name. Send the notebook file to another student to check your answers.

## IX. Summary and Review

### A. Summary
Regular expressions are a big topic. There are several advanced techniques that were not covered in this notebook. Refer to this [excellent regular expression tutorial](https://www.regular-expressions.info/tutorial.html) if you would like to learn more. This tutorial describes several advanced features that are not available in the *Python Standard Library's* `re` module. There is a third party regular expression package, called `regex`, that provides the advanced features and is available via Anaconda.

To use the `regex` package, first install the package from the command line:
```bash
conda install regex
```

Then import the package within your Python code:
```python
import regex as re
```

### B. Review
You should be able to define the following terms or describe the concept.
* Regular expression pattern
* Literal characters
* Special characters
* Pre-defined character classes
* User-defined character classes
* Anchors
* Capture groups
* Quantifiers
* Flags
* Back-references
* Alternation
* Regular expression functions
* Compiled regular expressions
* Verbose patterns
* Greedy vs. non-greedy matching
* Pandas string methods

[Table of Contents](../../index.ipynb)