In [2]:
import re

# **Review of Regex Symbols**

This chapter covered a lot of notation, so here’s a quick review of what
was learnt:

• The ? matches zero or one of the preceding group.

• The * matches zero or more of the preceding group.

• The + matches one or more of the preceding group.

• The {n} matches exactly n of the preceding group.

• The {n,} matches n or more of the preceding group.

• The {,m} matches 0 to m of the preceding group.

• The {n,m} matches at least n and at most m of the preceding group.

• {n,m}? or *? or +? performs a nongreedy match of the preceding group.

• ^spam means the string must begin with spam.

• spam$ means the string must end with spam.

• The . matches any character, except newline characters.

• \d, \w, and \s match a digit, word, or space character, respectively.

• \D, \W, and \S match anything except a digit, word, or space character,
  respectively.

• [abc] matches any character between the brackets (such as a, b, or c). 

• [^abc] matches any character that isn’t between the brackets.

# **Case-Insensitive Matching**

Normally, regular expressions match text with the exact casing that is specified.

For example, the following regexes match completely different strings:

In [None]:
regex1 = re.compile('RoboCop')
regex2 = re.compile('ROBOCOP')
regex3 = re.compile('robOcop')
regex4 = re.compile('RobocOp')

To make the regex case-insensitive, re.IGNORECASE or re.I can be passed  as a second argument to re.compile().

Enter the following into the interactive shell:

In [None]:
robocop = re.compile(r'robocop', re.I)
robocop.search('Robocop is part man, part machine, all ROBOcop.').group()

'Robocop'

In [None]:
robocop.search('ROBOCOP protects the innocent.').group()

'ROBOCOP'

In [None]:
robocop.search('Al, why does your programming book talk about robocop so much?').group()

'robocop'

# **Substituting Strings with the sub() Method**
Regular expressions can not only find text patterns but can also substitute
new text in place of those patterns. 

The sub() method for Regex objects is passed two arguments. 

The first argument is a string to replace any matches.

The second is the string for the regular expression. 

The sub() method returns a string with the substitutions applied.

For example, enter the following into the interactive shell:

In [None]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('********', 'Agent Alice gave the secret documents to Agent Bob.')

To **censor** the names of the secret agents by showing just the first letters of their names. 

To do this,  the regex Agent (\w)\w* and pass r'\1****' could be used  as the first argument to sub(). 

The \1 in that string will be replaced by whatever text was matched by group 1—
that is, the (\w) group of the regular expression.

In [6]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

# **Managing Complex Regexes**
Regular expressions are fine if the text pattern that needs to match is simple.

But matching complicated text patterns might require long, convoluted regular
expressions. 

This can be mitigated  by telling the re.compile() function to ignore whitespace and comments inside the regular expression string.

This “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:

In [7]:
phoneRegex = re.compile(r'((\d{2}|\(\d{2}\))?(\s|-|\.)?\d{5}(\s|-|\.)\d{5}(\s*(ext|x|ext.)\s*\d{2,5})?)')

The regular expression can be spread  over multiple lines with comments like this:

In [None]:
phoneNumRegex = re.compile(r'''(
(\d{2}|\(\d{2}\))? # Country code
(\s|-|\.)? # separator
\d{5} # first 5 digits
(\s|-|\.) # separator
\d{5} # last 5 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)

mo = phoneNumRegex.search('My number is 91-78555-42421 ext.58.')
print('Phone number found: ' + mo.group())

**Note**: 
Observe that the previous example uses the triple-quote syntax (''') to create a multiline string so that it can spread the regular expression definition
over many lines, making it much more legible.

The comment rules inside the regular expression string are the same as regular Python code: The # symbol and everything after it to the end of the
line are ignored. 

Also, the extra spaces inside the multiline string for the regular expression are not considered part of the text pattern to be matched.

This lets the regular expression to be organized  so that it’s easier to read.

# **Combining re.IGNORECASE, re.DOTALL , and re.VERBOSE**

To use re.VERBOSE to write comments in the regular expression, also it is required to use re.IGNORECASE for ignoring capitalization? 

Unfortunately, the re.compile() function takes only a single value as its second argument. 

This limitation can be overcome by combining the re.IGNORECASE, re.DOTALL, and re.VERBOSE variables using the pipe character (|), which in this context is
known as the bitwise or operator.

So if a regular expression that’s case-insensitive is  required  and it includes the newlines to match the dot character, then the  re.compile() call can be formed like this:

In [None]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)


All three options for the second argument will look like this:

In [None]:
someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)


# **Project: Phone Number and Email Address Extractor**

Suppose there is a boring task of finding every phone number and email
address in a long web page or document. 

By manually scrolling through the page, we might end up searching for a long time. 

But if there is a program that could search the text in the clipboard for phone numbers and email addresses, simply press **ctrl-A** to select all the text, press **ctrl-C** to copy it to the clipboard, and then run the program. 

It could replace the text on the clipboard with just the phone numbers and email
addresses it finds.

Whenever you’re tackling a new project, it can be tempting to dive right into writing code. 

But more often than not, it’s best to take a step back and consider the bigger picture. 

It is recommended, to first draw a high-level plan for what the program needs to do. 

Don’t think about the actual code yet, it can be taken up later. 

Right now, stick to broad strokes.

**For example**, the phone and email address extractor program will need to do the following:

• Get the text off the clipboard.

• Find all phone numbers and email addresses in the text.

• Paste them onto the clipboard.

Now you can start thinking about how this might work in code. 

The code will need to do the following:

• Use the pyperclip module to copy and paste strings.

• Create two regexes, one for matching phone numbers and the other for matching 
  email addresses.

• Find all matches, not just the first match, of both regexes. 

• Neatly format the matched strings into a single string to paste.

• Display some kind of message if no matches were found in the text.

# **Step 1: Create a Regex for Phone Numbers**
First, Create a regular expression to search for phone numbers.

Create a new file, enter the following, and save it as phoneAndEmail.py:

In [None]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
(\s|-|\.) # separator
(\d{4}) # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
)''', re.VERBOSE)
# TODO: Create email regex.
# TODO: Find matches in clipboard text.
# TODO: Copy results to the clipboard.

The TODO comments are just a skeleton for the program. 

They’ll be replaced while writing the actual code.

The phone number begins with an optional area code, so the area code group is followed with a **question** mark. 

Since the Country code can be just three digits (that is, \d{2}) or two digits within parentheses (that is, \(\d{2}\)), you should have a pipe joining those parts. 

the regex comment can be added as  # Country code to this part of the multiline string to help in remembering what (\d{2}|\(\d{2}\))? is supposed to match.

The phone number separator character can be a space (\s), hyphen (-), or period (.), so these parts should also be joined by pipes. 

The next few parts of the regular expression are straightforward: five digits, followed by another separator, followed by five digits. 

The last part is an optional extension made up of any number of spaces followed by ext, x, or ext., followed by two to five digits.

# **Step 2: Create a Regex for Email Addresses**
A regular expression will also be needed,  that can match email addresses.
Make the program look like the following:

In [None]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
--snip--
# Create email regex.
emailRegex = re.compile(r'''(
u [a-zA-Z0-9._%+-]+ # username
v @ # @ symbol
w [a-zA-Z0-9.-]+ # domain name
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE)
# TODO: Find matches in clipboard text.
# TODO: Copy results to the clipboard.

The username part of the email address u is one or more characters that can be any of the following: lowercase and uppercase letters, numbers, a dot, an underscore, a percent sign, a plus sign, or a hyphen. 

all of these can be put  into a character class: [a-zA-Z0-9._%+-].

The domain and username are separated by an @ symbol v. 

The domain name w has a slightly less permissive character class with only letters, numbers, periods, and hyphens: [a-zA-Z0-9.-]. 

And last will be the “dot-com” part (technically known as the top-level domain), which can really be dot-anything. 

This is between two and four characters.

The format for email addresses has a lot of weird rules. 

This regular expression won’t match every possible valid email address, but it’ll match almost any typical email address that will be encountered.

# **Step 3: Find All Matches in the Clipboard Text**
Now that you have specified the regular expressions for phone numbers and email addresses, you can let Python’s re module do the hard work of finding all the matches on the clipboard. 

The pyperclip.paste() function will get a string value of the text on the clipboard, and the findall() regex method will return a list of tuples.

Make the program look like the following:

In [None]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
import pyperclip, re
phoneRegex = re.compile(r'''(
--snip--
# Find matches in clipboard text.
text = str(pyperclip.paste())
matches = []
  for groups in phoneRegex.findall(text):
  phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
      phoneNum += ' x' + groups[8]
    matches.append(phoneNum)
 for groups in emailRegex.findall(text):
  matches.append(groups[0])
# TODO: Copy results to the clipboard.

There is one tuple for each match, and each tuple contains strings for
each group in the regular expression. 

Remember that group 0 matches the entire regular expression, so the group at index 0 of the tuple is the one you are interested in.

As it can be see at 1, it will store the matches in a list variable named matches.

It starts off as an empty list, and a couple for loops. 

For the email addresses, you append group 0 of each match 3. 

For the matched phone numbers, you don’t want to just append group 0. 

While the program detects phone numbers in several formats, you want the phone number appended to be in a single, standard format. 

The phoneNum variable contains a string built from groups 1, 3, 5, and 8 of the matched text 2. 

(These groups are the area code, first three digits, last four digits, and extension.)

***Step 4: Join the Matches into a String for the Clipboard***
Now that the email addresses and phone numbers are there as a list of strings
in matches, it's needed to put them on the clipboard. 

The pyperclip.copy() function takes only a single string value, not a list of strings, so the join() method can be called on matches.

To make it easier to see that the program is working, let’s print any matches you find to the terminal. 

And if no phone numbers or email addresses were found, the program should tell the user this.

Make your program look like the following:

In [None]:
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
--snip--
for groups in emailRegex.findall(text):
matches.append(groups[0])
# Copy results to the clipboard.
if len(matches) > 0:
pyperclip.copy('\n'.join(matches))
print('Copied to clipboard:')
print('\n'.join(matches))
else:
print('No phone numbers or email addresses found.')

# **The Full Code of the Program is as follows:**

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Sun Nov  1 19:54:37 2020

@author: Dr K Satyanarayan Reddy
"""
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

phoneRegex = re.compile(r'''(
    (\d{2}|\(\d{2}\))? # Country Code
    (\s|-|\.)? # Separator -
    (\d{5}) # first 3 digits
    (\s|-|\.) # Separator -
    (\d{5}) # last 5 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))? # Extension Number followed by ext
    )''', re.VERBOSE)
# TODO: Create email regex.
# TODO: Find matches in clipboard text.
# TODO: Copy results to the clipboard.
# Create email regex.

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+ # username
    @ # @ symbol
    [a-zA-Z0-9.-]+ # domain name
    (\.[a-zA-Z]{2,4}) # dot-something
    )''', re.VERBOSE)
# TODO: Find matches in clipboard text.
# TODO: Copy results to the clipboard.

# Find matches in clipboard text.

text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups[8] != '':
        phoneNum += ' ext.' + groups[8]
        matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])
        
# TODO: Copy results to the clipboard.

for groups in emailRegex.findall(text):
    matches.append(groups[0])
# Copy results to the clipboard.

if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')



# **Running the Program**
For an example, open the web browser to the No Starch Press contact page
at http://www.nostarch.com/contactus.htm, press ctrl-A to select all the text on
the page, and press ctrl-C to copy it to the clipboard. 

When you run this program, the output will look something like this:

In [None]:
Copied to clipboard:
800-420-7240
415-863-9900
415-863-9950
info@nostarch.com
media@nostarch.com
academic@nostarch.com
help@nostarch.com