# Chapter 9 - Text Pattern Matching with Regex

## Notes

### A REVIEW OF REGEX SYMBOLS

This chapter has covered a lot of notation so far, so here’s a quick review of what you’ve learned about basic regular expression syntax:  

The ? matches zero or one instance of the preceding qualifier.  
The * matches zero or more instances of the preceding qualifier.  
The + matches one or more instances of the preceding qualifier.  
The {n} matches exactly n instances of the preceding qualifier.  
The {n,} matches n or more instances of the preceding qualifier.  
The {,m} matches 0 to m instances of the preceding qualifier.  
The {n,m} matches at least n and at most m instances of the preceding qualifier.  
{n,m}? or *? or +? performs a non-greedy match of the preceding qualifier.  
^spam means the string must begin with spam.  
spam$ means the string must end with spam.  
The . matches any character, except newline characters.  
The \d, \w, and \s match a digit, word, or space character, respectively.  
The \D, \W, and \S match anything except a digit, word, or space character, respectively. [abc] matches any character between the square brackets (such as a, b, or c).  
[^abc] matches any character that isn’t between the square brackets.  
(Hello) groups 'Hello' together as a single qualifier.  

### Project 3: Extract Contact Information from Large Documents

Say you’ve been given the boring task of finding every phone number and email address in a long web page or document. If you manually scroll through the page, you might end up searching for a long time. But if you had a program that could search the text in your clipboard for phone numbers and email addresses, you could simply press CTRL-A to select all the text, press CTRL-C to copy it to the clipboard, and then run your program. It could replace the text on the clipboard with just the phone numbers and email addresses it finds.  

Whenever you’re tackling a new project, it can be tempting to dive right into writing code. But more often than not, it’s best to take a step back and consider the bigger picture. I recommend first drawing up a high-level plan for what your program needs to do. Don’t think about the actual code yet; you can worry about that later. Right now, stick to broad strokes.  

For example, your phone number and email address extractor will need to do the following:  

Get the text from the clipboard.  
Find all phone numbers and email addresses in the text.  
Paste them onto the clipboard.  
Now you can start thinking about how this might work in code. The code will need to do the following:  

Use the pyperclip module to copy and paste strings.  
Create two regexes, one for matching phone numbers and one for matching email addresses.  
Find all matches (not just the first match) of both regexes.  
Neatly format the matched strings into a single string to paste.  
Display some kind of message if no matches were found in the text.  
This list is like a road map for the project. As you write the code, you can focus on each of these steps separately, and each step should seem fairly manageable. They’re also expressed in terms of things you already know how to do in Python.

#### Step 1: Create a Regex for Phone Numbers
First, you have to create a regular expression to search for phone numbers. Create a new file, enter the following, and save it as phoneAndEmail.py:

In [None]:
# import pyperclip, re

# phone_re = re.compile(r'''(
#     (\d{3}|\(\d{3}\))?  # Area code
#     (\s|-|\.)?  # Separator
#     (\d{3})  # First three digits
#     (\s|-|\.)  # Separator
#     (\d{4})  # Last four digits
#     (\s*(ext|x|ext\.)\s*(\d{2,5}))?  # Extension
#     )''', re.VERBOSE)

# TODO: Create email regex.

# TODO: Find matches in clipboard text.

# TODO: Copy results to the clipboard.

The TODO comments are just a skeleton for the program. They’ll be replaced as you write the actual code.

The phone number begins with an optional area code, so we follow the area code group with a question mark. Since the area code can be just three digits (that is, `\d{3}`) or three digits within parentheses (that is, `\(\d{3}\))`, you should have a pipe joining those parts. You can add the regex comment # Area code to this part of the multiline string to help you remember what `(\d{3}|\(\d{3}\))?` is supposed to match.  

The phone number separator character can be an optional space (\s), hyphen (-), or period (.), so we should also join these parts using pipes. The next few parts of the regular expression are straightforward: three digits, followed by another separator, followed by four digits. The last part is an optional extension made up of any number of spaces followed by ext, x, or ext., followed by two to five digits.  

#### Step 2: Create a Regex for Email Addresses
You will also need a regular expression that can match email addresses. Make your program look like the following:

In [None]:
# import pyperclip, re

# phone_re = re.compile(r'''(
# --snip--

# # Create email regex.
# email_re = re.compile(r'''(
#   ❶ [a-zA-Z0-9._%+-]+  # Username
#   ❷ @  # @ symbol
#   ❸ [a-zA-Z0-9.-]+  # Domain name
#     (\.[a-zA-Z]{2,4})  # Dot-something
#     )''', re.VERBOSE)

# TODO: Find matches in clipboard text.

# TODO: Copy results to the clipboard.

The username part of the email address ❶ consists of one or more characters that can be any of the following: lowercase and uppercase letters, numbers, a dot, an underscore, a percent sign, a plus sign, or a hyphen. You can put all of these into a character class: [a-zA-Z0-9._%+-].

The domain and username are separated by an @ symbol ❷. The domain name ❸ has a slightly less permissive character class, with only letters, numbers, periods, and hyphens: [a-zA-Z0-9.-]. Last is the “dot-com” part (technically known as the top-level domain), which can really be dot-anything.

The format for email addresses has a lot of weird rules. This regular expression won’t match every possible valid email address, but it will match almost any typical email address you’ll encounter.

#### Step 3: Find All Matches in the Clipboard Text
Now that you’ve specified the regular expressions for phone numbers and email addresses, you can let Python’s re module do the hard work of finding all the matches on the clipboard. The pyperclip.paste() function will get a string value of the text on the clipboard, and the findall() regex method will return a list of tuples.  

Make your program look like the following:

In [None]:
# import pyperclip, re

# phone_re = re.compile(r'''(
# --snip--

# Find matches in clipboard text.
# text = str(pyperclip.paste())

# ❶ matches = []
# ❷ for groups in phone_re.findall(text):
#     phone_num = '-'.join([groups[1], groups[3], groups[5]])
#     if groups[6] != '':
#         phone_num += ' x' + groups[6]
#     matches.append(phone_num)
# ❸ for groups in email_re.findall(text):
#     matches.append(groups)

# TODO: Copy results to the clipboard. 

There is one tuple for each match, and each tuple contains strings for each group in the regular expression. Remember that group 0 matches the entire regular expression, so the group at index 0 of the tuple is the one you are interested in.  

As you can see at ❶, you’ll store the matches in a list variable named matches. It starts off as an empty list and a couple of for loops. For the email addresses, you append group 0 of each match ❸. For the matched phone numbers, you don’t want to just append group 0. While the program detects phone numbers in several formats, you want the phone number appended to be in a single, standard format. The phone_num variable contains a string built from groups 1, 3, 5, and 6 of the matched text ❷. (These groups are the area code, first three digits, last four digits, and extension.)

#### Step 4: Join the Matches into a String
Now that you have the email addresses and phone numbers as a list of strings in matches, you want to put them on the clipboard. The pyperclip.copy() function takes only a single string value, not a list of strings, so you must call the join() method on matches.

Make your program look like the following:

In [None]:
# import pyperclip, re

# phone_re = re.compile(r'''(
# --snip--
# for groups in email_re.findall(text):
#     matches.append(groups[0])

# # Copy results to the clipboard.
# if len(matches) > 0:
#     pyperclip.copy('\n'.join(matches))
#     print('Copied to clipboard:')
#     print('\n'.join(matches))
# else:
#     print('No phone numbers or email addresses found.')

To make it easier to see that the program is working, we also print any matches you find to the terminal window. If no phone numbers or email addresses were found, the program tells the user this.

To test your program, open your web browser to the No Starch Press contact page at https://nostarch.com/contactus press CTRL-A to select all the text on the page, and press CTRL-C to copy it to the clipboard. When you run this program, the output should look something like this:

```Copied to clipboard:
800-555-7240
415-555-9900
415-555-9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
info@nostarch.com

You can modify this script to search for mailing addresses, social media handles, and many other types of text patterns.

## Practice Questions

1. What is the function that returns Regex objects?  
    **Answer:** re.compile().search()

2. Why are raw strings often used when creating Regex objects?  
    **Answer:** Because there are often backslashes in the raw strings which makes it likely to have invalid string literals

3. What does the search() method return?  
    **Answer:** It returns a match object

4. How do you get the actual strings that match the pattern from a Match object?  
    **Answer:** match.group()

5. In the regex created from r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group 0 cover? Group 1? Group 2?  
    **Answer:** Group 0 covers the whole string that matches combining both groups, while group1 matches covers the area code, and group2 covers the phone number.

6. Parentheses and periods have specific meanings in regular expression syntax. How would you specify that you want a regex to match actual parentheses and period characters?  
    **Answer:** With a backslash character (escape character).

7. The `findall()` method returns a list of strings or a list of tuples of strings. What makes it return one or the other?  
    **Answer:** If a group is defined in the regex raw string and a match occurs then `findall()` will return a list of tuples, Otherwise a list of strings.

8. What does the | character signify in regular expressions?  
    **Answer:** *bitwise or*

9. What two things does the ? character signify in regular expressions?  
    **Answer:** `? {0, 1}` (optional)

10. What is the difference between the + and * characters in regular expressions?  
    **Answer:** `+ {1, }`, `* {0, }`

11. What is the difference between {3} and {3,5} in regular expressions?  
    **Answer:** Quantifier matching only 3 and Quantifier matching anything between 3 and 5 occurences.

12. What do the \d, \w, and \s shorthand character classes signify in regular expressions?  
    **Answer:** `\d = [0-9]`, `\w = Any letter, numeric digit, or the underscore character. (Think of this as matching “word” characters.)`, `\s = Any space, tab, or newline character. (Think of this as matching “space” characters.)`

13. What do the \D, \W, and \S shorthand character classes signify in regular expressions?  
    **Answer:** It is the same as a boolean `not` operator for `\d`, `\w`, and `\s` respectively. 

14. What is the difference between the .* and .*? regular expressions?  
    **Answer:** `.* = non-greedy mode`, `.*? greedy mode`, greedy mode takes a longest match of something if there are multiple matches, while non-greedy takes the shortest.

15. What is the character class syntax to match all numbers and lowercase letters?  
    **Answer:** `[0-9][a-z]`

16. How do you make a regular expression case-insensitive?  
    **Answer:** Pass `re.IGNORECASE` as the second argument to `re.compile()`.

17. What does the . character normally match? What does it match if re.DOTALL is passed as the second argument to re.compile()?  
    **Answer:** It normally matches everything except a newline character. If `re.DOTALL` is passed as the second argument it includes the newline character in its search.

18. If num_re = re.compile(r'\d+'), what will num_re.sub('X', '12 drummers, 11 pipers, five rings, 3 hens') return?  
    **Answer:** `'X drummers, 11 pipers, five rings, 3 hens'`

19. What does passing re.VERBOSE as the second argument to re.compile() allow you to do?  
    **Answer:** Allows you to have comments, in your raw string.

## Practice Programs

### Strong Password Detection
Write a function that uses regular expressions to make sure the password string it is passed is strong. A strong password has several rules: it must be at least eight characters long, contain both uppercase and lowercase characters, and have at least one digit. Hint: It’s easier to test the string against multiple regex patterns than to try to come up with a single regex that can validate all the rules.

In [18]:
import re

def is_strong_password(password: str):
    # 1. Must be at least 8 characters long
    # 2. Must contain both uppercase and lowercase characters
    # 3. Must have at least 1 digit
    pass_len_re = re.compile(r'.{8,}')
    if not pass_len_re.search(password):
        return False
    
    lower_re = re.compile(r'[a-z]')
    if not lower_re.search(password):
        return False
    
    upper_re = re.compile(r'[A-Z]')
    if not upper_re.search(password):
        return False
    
    digit_re = re.compile(r'\d')
    if  not digit_re.search(password):
        return False
    
    return True


print(is_strong_password('aW1'))

False


### Regex Version of the strip() Method
Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then the function should remove whitespace characters from the beginning and end of the string. Otherwise, the function should remove the characters specified in the second argument to the function.

In [26]:
def re_strip(string: str, chars: str = None):
    if not chars:
        space_re = re.compile(r'\s+')
        return space_re.sub('', string)
    else:
        remove_re = re.compile(f"[{re.escape(chars)}]")
        return remove_re.sub("", string)


re_strip(" Hello ", "el")

' Ho '