# Working with Text, Part II

In the preceding session we studied string literals, escape sequences, string formatting, encoding, string slicing, and string methods. While a programmer can get a lot of work done with these techniques, they only go so far. If you wanted to search a block of text chemical formulas or email addresses, you probably would need to write custom functions, and these custom functions would be difficult to debug.

In this session we'll study a text processing tool called *regular expressions*. Regular expressions can complete sophisticated text processing jobs, like identfying email address or chemical formulas, in just a few lines of code. It takes some effort to learn regular expressions, but they are available in most programming languages. Once you learn them you can use them anywhere!

Finally, we'll cover some of the text processing tools that are available in the Pandas package. It's pretty common to have a text column in a dataframe and to need to conduct some sort of transformation or evaluation of that column.

## I. Regular Expresions

### A. Our First Regular Expression
First, let's create a long string on which we can try out some regular expressions (RE). 

In [1]:
shel_poem = """
I told my robot to do my biddin'
He yawned and said, "You must be kiddin'."
I told my robot to cook me a stew.
He said, "I got better things to do."
I told my robot to sweep my shack.
He said. "You want me to strain my back?"
I told my robot to answer the phone.
He said, "I must make some calls of my own."
I told my robot to brew me some tea.
He said, "Why don’t you make tea for me?"
I told my robot to boil me an egg.
He said, "First -– lemme hear you beg."
I told my robot, "There’s a song you can play me."
He said, "How much are you gonna pay me?"
So I sold that robot, 'cause I never knew
Exactly who belonged to who.
"""

To use regular expressions in Python we must first import the *re* module from the standard libary. We then create a pattern that specifies what we want to do, and finally we pass that pattern to a regular expression function.

In [7]:
# First regular expression example
import re

pattern = r"robot"

mtch = re.search(pattern, shel_poem)
mtch

<re.Match object; span=(11, 16), match='robot'>

In the preceding example, we created the regular expression pattern "robot" using a raw string. (We'll see why raw strings are useful later.) We then passed the pattern  and a poem by Shel Siverstein to the `search()` function from Python's `re` module. The search function finds the first span of text in the poem that matches the pattern, and returns the results as a `Match` object. The `Match` object tells us that the first occurrence of "robot" starts at position 11 and ends at position 16. Note that for `Match` objects, the first character is at position 1, not 0.

`Match` objects have several different methods and attributes, which are described in the [documentation for the `re` module](https://docs.python.org/3/library/re.html#match-objects). We can extract the position of the match with the `.span()` method, and we can extract the text that matched the pattern using list-style indexing.

In [20]:
mtch[0]

'robot'

You might be wondering why we would use list style indexing to extract the text that was matched ("robot"). The answer is that it's possible for a `Match` object to contain more than one search result, which will be explained later.

If no match is found, `re.search()` returns `None`.

In [25]:
type(re.search("Python", shel_poem))

NoneType

## B. So What?
Right now, you might be thinking "So what?" We could have easily found the position of the word "robot" using the `.find()` method. Using the `.find()` method requires only one line of code, we don't have to import the `re` module, and we don't have to deal with a weird `Match` object. This is all true. If all you want to do is find the position of a string within another string, `.find()` is a good way to go.

In [22]:
# Using the .find() method
shel_poem.find("robot")

11

The answer is that we have not even scratched the surface of what regular expressions can do. To get a better understanding of their potential, we'll start by trying out another function from the `re` module. The `re.findall()` function finds ALL matches within a string.

In [28]:
# .findall() example
pattern = r"told"
re.findall(pattern, shel_poem)

['told', 'told', 'told', 'told', 'told', 'told', 'told']

The `.findall()` method returns the matches as a list. Now you're probably thinking "OK, that's a little bit more interesting, but I'm still not impressed." We can see that the word "told" occurs seven times within the poem, but the `.findall()` method does not provide the position of the matches. Lets make it more interesting.

In [37]:
ptn = r"to\s(\w+)\W"
re.findall(ptn, shel_poem)

['do', 'cook', 'do', 'sweep', 'strain', 'answer', 'brew', 'boil', 'who']

What just happened? The pattern in this example returns every word in the poem that follows the word "to".