<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/xx_misc/regular_expressions/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to Regular Expressions

[Regular Expressions](https://docs.python.org/3/howto/regex.html) are a powerful feature of the Python programming language. You can access Python's regular expression support through the [re](https://docs.python.org/3/library/re.html#module-re) module.

## Matching Literals

A regular expression is simply a string of text. The most basic regular expression is just a string containing only alphanumeric characters.

We use the `re.compile(...)` method to convert the regular expression string into a `Pattern` object.

In [None]:
import re

In [None]:
pattern = re.compile('Hello')
type(pattern)

Now that we have a compiled regular expression, we can see if the pattern matches another string.

In [None]:
if pattern.match('Hello World'):
  print("We found a match")
else:
  print("No match found")

In the case above we found a match because `'Hello'` is part of `'Hello World'`.

What happens if `'Hello'` is not at the start of a string?

In [None]:
if pattern.match('I said Hello World'):
  print("We found a match")
else:
  print("No match found")

So the match only works if the pattern matches the start of the other string. What if the case is different?

In [None]:
if pattern.match('HELLO'):
  print("We found a match")
else:
  print("No match found")

Doesn't work. By default, the match is case sensitive.

What if it is only a partial match?

In [None]:
if pattern.match('He'):
  print("We found a match")
else:
  print("No match found")

From what we have seen so far, matching with a string literal is pretty much functionally equivalent to the Python `startswith(...)` method that already comes as part of the `String` class.

In [None]:
if "Hello World".startswith("Hello"):
  print("We found a match")
else:
  print("No match found")

Well, that isn't too exciting. But it does provide us with an opportunity for a valuable lesson: *Regular expressions are often not the best solution for a problem.*

As we continue on in this colab, we'll see how powerful and expressive regular expressions can be. It is tempting to whip out a regular expression for many cases where they may not be the best solution. The regular expression engine can be slow for many types of expressions. Sometimes using other built-in tools or coding a solution in standard Python is better; sometimes it isn't.

## Repetition

Matching exact characters one-by-one is kind of boring and doesn't allow regular expressions to showcase their true power. Let's move on to some more dynamic parts of the regular expression language. We will begin with repetition.

### One or More

There are many cases where you'll need "one or more" of some character. To accomplish this, you simply add the `+` sign after the character that you want one or more of.

In the example below, we create an expression that looks for one or more 'b' characters. Notice how 'abc' and 'abbbbbbbc' are fine, but if we take all of the 'b' characters out, we don't get a match.

In [None]:
pattern = re.compile("ab+c")

for string in (
  'abc',
  'abbbbbbbc',
  'ac',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### Zero or More

Sometimes we find ourselves in a situation where we're actually okay with "zero or more" instances of a character. For this we use the '*' sign.

In the example below we create an expression that looks for zero or more 'b' characters. In this case all of the matches are successful.

In [None]:
pattern = re.compile("ab*c")

for string in (
  'abc',
  'abbbbbbbc',
  'ac',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### One or None

We've now seen cases where we will allow one-and-only-one of a character (exact match), one-or-more of a character, and zero-or-more of a character. The next case is the "one or none" case. For that we use the '?' sign.

In [None]:
pattern = re.compile("ab?c")

for string in (
  'abc',
  'abbbbbbbc',
  'ac',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### M

What if you want to match a very specific number of a specific character, but you don't want to type all of those characters in? The `{m}` expression is great for that. The 'm' value specifies exactly how many repetitions you want.

In [None]:
pattern = re.compile("ab{7}c")

for string in (
  'abc', 
  'abbbbbbc', 
  'abbbbbbbc',
  'abbbbbbbbc',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### M or More

You can also ask for m-or-more of a character. Leaving a dangling comma in the `{m,}` does the trick.

In [None]:
pattern = re.compile("ab{2,}c")

for string in (
  'abc',
  'abbc',
  'abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### M through N

You can also request a specific range of repetition using `{m,n}`. Notice that 'n' is *inclusive*. This is one of the rare times that you'll find ranges in Python that are inclusive at the end. Any ideas why? 

In [None]:
pattern = re.compile("ab{4,6}c")

for string in (
  'abbbc',
  'abbbbc',
  'abbbbbc',
  'abbbbbbc',
  'abbbbbbbc',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### N or Fewer

Sometimes you want a specific number of repetitions or fewer. For this, you can use a comma before the 'n' parameter like `{,n}`. Notice that "fewer" includes zero instances of the character.

In [None]:
pattern = re.compile("ab{,4}c")

for string in (
  'abbbbbc', 
  'abbbbc', 
  'abbbc',
  'abbc',
  'abc',
  'ac',
  'a',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

Though we have illustrated these repetition operations on single characters, they actually apply to more complex combinations of characters, as we'll see soon.

## Character Sets

Matching a single character with repetition can be very useful, but often we want to work with more than one character. For that, the regular expressions need to have the concept of character sets. Character sets are contained within square brackets: `[]`

The character set below specifies that we'll match any string that starts with a vowel.

In [None]:
pattern = re.compile('[aeiou]')

for string in (
  'a',
  'e',
  'i',
  'o',
  'u',
  'x',
  'ax',
  'ex',
  'ix',
  'ox',
  'ux',
  'xa',
  'xe',
  'xi',
  'xo',
  'xu',
  'xx',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

Character sets can be bound to any of the repetition symbols that we have already seen. For example, if we wanted to match words that start with at least two vowels we could use the character set below.

In [None]:
pattern = re.compile('[aeiou]{2,}')

for string in (
  'aardvark',
  'earth',
  'eat',
  'oar',
  'aioli',
  'ute',
  'absolutely',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

Character sets can also be negated. Simply put a `^` symbol at the start of the character set.

In [None]:
pattern = re.compile('[^aeiou]')

for string in (
  'aardvark',
  'earth',
  'ice',
  'oar',
  'ukulele',
  'bathtub',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

## Character Classes

Some groupings of characters are so common that they have a shorthand "character class" assigned to them. Common character classes are represented by a backslash and a letter designating the class. For instance `\d` is the class for digits.

In [None]:
pattern = re.compile('\d')

for string in (
  'abc',
  '123',
  '1a2b',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

These classes can have repetitions after them, just like character sets.

In [None]:
pattern = re.compile('\d{4,}')

for string in (
  'a',
  '123',
  '1234',
  '12345',
  '1234a',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

There are many common character classes.

* \d matches digits
* \s matches spaces, tabs, etc.
* \w matches 'word' characters which include the letters of most languages, digits, and the underscore character

In [None]:
pattern = re.compile('\w\s\d')

for string in (
  'a',
  '1 3',
  '_ 4',
  'w 5',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

You can mix these classes with repetitions.

In [None]:
pattern = re.compile('\d+\s\w+')

for string in (
  'a',
  '16 Candles',
  '47 Hats',
  'Number 5',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

But what if you want to find everything that isn't a digit? Or everything that isn't a space?

To do that, simply put the character class in upper-case.

In [None]:
print("Not a digit")
pattern = re.compile('\D')
for string in (
  'a',
  '1',
  ' ',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

print("\n")
print("Not a space")
pattern = re.compile('\S')
for string in (
  'a',
  '1',
  ' ',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

print("\n")
print("Not a word")
pattern = re.compile('\W')
for string in (
  'a',
  '1',
  ' ',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

## Placement

We've moved into some pretty powerful stuff, but up until now all of our regular expressions have started matching from the first letter of a string. That is useful, but sometimes you'd like to match from anywhere in the string, or specifically at the end of the string. Let's explore some options for moving past the first character.

### The Dot

So far we have always had to have some character to match, but what if we don't care what character we encounter? The dot (`.`) is a placeholder for any character.

In [None]:
pattern = re.compile('.')

for string in (
  'a',
  ' ',
  '4',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

Though it might seem rather bland at first, the dot can be really useful when combined with repetition symbols.

In [None]:
pattern = re.compile('.*s')

for string in (
  'as',
  ' oh no bees',
  'does this match',
  'maybe',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

As you can see, using the dot allows us to move past the start of the string we want to match and instead search deeper inside the target string.

### Starting Anchor

Now we can search anywhere in a string. However, we might still want to add a starting anchor to the beginning of a string for part of our match. The `^`  anchors our match to the start of the string.

In [None]:
pattern = re.compile('^a.*s')

for string in (
  'as',
  'not as',
  'a string that matches',
  'a fancy string that matches',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

### Ending Anchor

We can anchor to the end of a string with the `$` symbol.

In [None]:
pattern = re.compile('.*s$')

for string in (
  'as',
  'beees',
  'sa',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')


## Grouping

We have searched for exact patterns in our data, but sometimes we want *either* one thing *or* another. We can group searches with parentheses and match only one item in a group.

In [None]:
pattern = re.compile('.*(cat|dog)')

for string in (
  'cat',
  'dog',
  'fat cat',
  'lazy dog',
  'hog',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

Grouping can also be done on a single item.

In [None]:
pattern = re.compile('.*(dog)')

for string in (
  'cat',
  'dog',
  'fat cat',
  'lazy dog',
  'hog',
  ):
  print("'{}'".format(string), end=' ')
  print('matches' if pattern.match(string) else 'does not match')

But why would you ever group a single item? It turns out that grouping is 'capture grouping' by default and allows you to extract items from a string.

In [None]:
pattern = re.compile('.*(dog)')

match = pattern.match("hot diggity dog")

if match:
  print(match.group(0))
  print(match.group(1))

In the case above, the entire string is considered group 0 because it matched the expression, but then the string 'dog' is group 1 because it was 'captured' by the parenthesis.

You can have more than one capture group:

In [None]:
pattern = re.compile('.*(dog).*(cat)')

match = pattern.match("hot diggity dog barked at a scared cat")

if match:
  print(match.group(0))
  print(match.group(1))
  print(match.group(2))

And capture groups can contain multiple values:

In [None]:
pattern = re.compile('.*(dog).*(mouse|cat)')

match = pattern.match("hot diggity dog barked at a scared cat")

if match:
  print(match.group(0))
  print(match.group(1))
  print(match.group(2))

Grouping can get even richer. For example:

- What happens when you have a group within another group?

- Can a group be repeated?

These are more intermediate-to-advanced applications of regular expressions that you might want to explore on your own.

## Substitution

So far we have been concerned with finding patterns in a string. Locating things is great, but sometimes you want to take action. A common action is substitution.

Say that I want to replace every instance of 'cat' or 'mouse' in a string with 'whale'. To do that I can compile a pattern that looks for 'cat' or 'mouse' and use that pattern in the `re.sub` method.

In [None]:
pattern = re.compile('(cat|mouse)')

re.sub(pattern, 'whale', 'The dog is afraid of the mouse')

So far, we have compiled all of our regular expressions before using them. It turns out that many of the regular expression methods can accept a string and will compile that string for you.

You might see something like the code below in practice:

In [None]:
re.sub('(cat|mouse)', 'whale', 'The dog is afraid of the mouse')

`sub` is compiling the string "(cat|mouse)" into a pattern and then applying it to the input string.

## Raw Strings

While working with Python code that uses regular expressions, you might occasionally encounter a string that looks like `r'my string'` instead of the `'my string'` that you are accustomed to seeing.

The `r` designation means that the string is a *raw* string. Let's look at some examples to see what this means.

In [None]:
print('\tHello')
print(r'\tHello')
print('\\')
print(r'\\')

You'll notice that the regular string containing `\t` printed a tab character. The raw string printed a literal `\t`. Likewise the regular string printed `\` while the raw string printed `\\`.

When processing a string, Python looks for escape sequences like `\t` (tab), `\n` (newline), `\\` (backslash) and others to make your printed output more visually appealing.

Raw strings turn off that translation. This is useful for regular expressions because the backslash is a common character in regular expressions. Translating backslashes to other characters would break the expression.

Should you always use a raw string when creating a regular expression? Probably. Even if it isn't necessary now, the expression might grow over time, and it is helpful to have it in place as a safeguard.

# Exercises

## Exercise 1: Starts With 'a'

Create a regular expression pattern object that matches strings starting with the lower-case letter 'a'. Apply it to the test data provided. Loop over each string of test data and print "match" or "no match" as a result of your expression.

### **Student Solution**

In [None]:
test_data = [
  'apple',
  'banana',
  'grapefruit',
  'apricot',
  'orange'
]

# Create a pattern here

for test in test_data:
  pass # Your pattern match goes here 

---

### Answer Key

In [None]:
test_data = [
  'apple',
  'banana',
  'grapefruit',
  'apricot',
  'orange'
]

pattern = re.compile(r'a.*')

for test in test_data:
  print("'{}'".format(test), end=' ')
  print('matches' if pattern.match(test) else 'does not match')

---

## Exercise 2: Contains 'zoo' or 'ZOO'


Create a regular expression pattern object that matches strings containing 'zoo' or 'ZOO'. Apply it to the test data provided. Loop over each string of the test data and print "match" or "no match" as a result of your expression.

### **Student Solution**

In [None]:
test_data = [
  'zoo',
  'ZOO',
  'bazooka',
  'ZOOLANDER',
  'kaZoo',
  'ZooTopia',
  'ZOOT Suit',
]

# Create a pattern here

for test in test_data:
  pass # Your pattern match goes here 

---

### Answer Key

In [None]:
test_data = [
  'zoo',
  'ZOO',
  'bazooka',
  'ZOOLANDER',
  'kaZoo',
  'ZooTopia',
  'ZOOT Suit',
]

print('Using pattern.match')
pattern = re.compile(r'.*(zoo|ZOO).*')

for test in test_data:
  print("'{}'".format(test), end=' ')
  print('matches' if pattern.match(test) else 'does not match') 

print('\n')
print('Using pattern.search')
pattern = re.compile(r'zoo|ZOO')

for test in test_data:
  print("'{}'".format(test), end=' ')
  print('matches' if pattern.search(test) else 'does not match') 

---

## Exercise 3: Endings

Create a regular expression pattern object that finds words that end with 'ing', independent of case. Apply it to the test data provided. Loop over each string of the test data and print "match" or "no match" as a result of your expression.

### **Student Solution**

In [None]:
test_data = [
  'sing',
  'talking',
  'SCREAMING',
  'NeVeReNdInG',
  'ingeron',
]

# Create a pattern here

for test in test_data:
  pass # Your pattern match goes here 

---

### Answer Key

In [None]:
test_data = [
  'sing',
  'talking',
  'SCREAMING',
  'NeVeReNdInG',
  'ingeron',
]

# re.IGNORECASE makes the pattern case insensitive.
# $ anchors to the end of the word.
pattern = re.compile(r'.*ing$', re.IGNORECASE)

for test in test_data:
  print("'{}'".format(test), end=' ')
  print('matches' if pattern.match(test) else 'does not match') 

---