## Finding Patterns of Text Without Regular Expressions

Say you want to find a phone number in a string. You know the pattern: three area digits, a hyphen, three numbers, a hyphen, and four numbers. An Example: 415-555-4242. Let's create a function that find that on an text:

In [2]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    if text[7] != '-':
        return False
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True

print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

415-555-4242 is a phone number:
True
Moshi moshi is a phone number:
False


We can simplify it by chuncking 12 digits and testing the text. Here it is:

In [3]:
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: ' + chunk)
print('Done')

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done


## Using Regular Expressions to Find Patterns

Regular expressions - called "Regex" for short - are functions that find patterns in the text. For doing so, we need to import the re module.

In [5]:
import re

#### re.compile
We can create a pattern that we will use to find on the text with the re.compile expression. For example: `re.compile('Atlas')`. This code pattern the string Atlas, with the first 'A' uppercase. We can store that on an regex object the same way we create other objects.

To find the pattern of a phone, we will need to call the common character to the digits, the backslash+d (\d). So it will be like this example:

In [6]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242


#### Grouping
We can also group the parts of a pattern. To do so we use the object.group() to called it, and use parenthesis on the re.compile to group it up. Using the phone number example:

In [8]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242')
mo.group(1)

'415'

In [9]:
mo.group(2)

'555-4242'

In [10]:
mo.group(0)

'415-555-4242'

In [11]:
mo.group()

'415-555-4242'

In [12]:
mo.groups()

('415', '555-4242')

In [16]:
areaCode, mainNumber = mo.groups()
print(areaCode)
print(mainNumber)

415
555-4242


So we have that the first group is composed by three digits, the area code (you can see it on re.compile as (\d\d\d), and you can called it by group(1).
The second group is the phone number composed by three digits, a hyphen, and four digits (you can see it on re.compile as (\d\d\d-\d\d\d\d), and you can called it by group(2).
There's an hyphen between the group(1) and the group(2) that is groupless.
You can call the whole pattern by using either group(0) or group().
When you call object.groups() you create a tuple of the patterns find on the text ordered by groups. And you can also assign variables for the groups on the tuple by using the multiple-assignment trick by doing `assign1, assign2 = object.groups()`.

#### Matching parentheses in text
Sometimes you need to match parentheses on the text and you need to put an backslash before the parentheses to indicate that you want to make it a part of the pattern.

In [7]:
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415) 555-4242')
print(mo.group(1))
print(mo.group(2))

(415)
555-4242


#### The Pipe Command
The pipe command '|' is used to match the first occurence of the words in the re.compile in a text.

In [22]:
heroRegex = re.compile(r'Batman|Tina Frey')
mo1 = heroRegex.search('Batman and Tina Frey')

In [23]:
mo1.group()

'Batman'

In [24]:
mo2 = heroRegex.search('Tina Frey and Batman')

In [25]:
mo2.group()

'Tina Frey'

In [26]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
mo.group()

'Batmobile'

In [27]:
mo.group(1)

'mobile'

#### Optional Match (zero or one match)
Optional Match can be find by using the question mark '?'

In [29]:
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo1.group()

'Batman'

In [30]:
mo2 = batRegex.search('The adventures of Batwoman')
mo2.group()

'Batwoman'

Using this idea on phone numbers, you can write a code that identifies phones with or without area code.

In [31]:
phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242')
print(mo1.group())
mo2 = phoneRegex.search('My number is 555-4242')
print(mo2.group())

415-555-4242
555-4242


#### Multiple Matches (zero or more)
The question mark ('?') finds none or one match.
On other hand, the asterix find zero or more (to infinite) matches.

In [33]:
batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batwowowoman')
print(mo3.group())

Batman
Batwoman
Batwowowoman


#### At Least One Match (one or more matches)
Now, if you want to have at least one match result you can use the plus mark '+'

In [35]:
batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwowowoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())
mo3 = batRegex.search('The Adventures of Batman')
print(mo3 == None)

Batwowowoman
Batwoman
True


#### Matching Specific Repetitions with Curly Brackets
You can match specific repetitions with curly brackets, for example:

In [8]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
print(mo1.group())
mo2 = haRegex.search('Hahaha')
print(mo2 == None)

HaHaHa
True


Also, you can put a range on the repetition.

In [39]:
haRegex = re.compile(r'((Ha){3,5})')
mo = haRegex.search('HaHaHaHaHa')
mo.group()

'HaHaHaHaHa'

The curly bracket repetitions could be as follows:

1. {n} - repeat n number of times.
2. {n,m} - repeat n to m number of times.
3. {,m} - repeat 0 to m number of times.
4. {n,} - repeat n to infinite number of times.

#### Greedy and Nongreedy

Python is greedy by nature, in other words it will also give the result of the higher value. On the last example it could give only three or four 'Ha' that it would be in the range, but it gave five instead, that's what we called greedy.
To make it nongreedy you could add a question mark after the closing curly bracket.

In [40]:
haRegex = re.compile(r'((Ha){3,5}?)')
mo = haRegex.search('HaHaHaHaHa')
mo.group()

'HaHaHa'

#### The findall() method
With search() method you will find just the first of the matches, but you can find all of them with findall() method.

In [6]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

If there are groups on the re.compile, the findall method will return a list of tuples.

In [7]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

### List of a shorthand codes for common character Classes

\d - Any numeric digit from 0 to 9

\D - Any character that is NOT a numeric digit from 0 to 9
\w - Any letter, numeric digit, or the underscore character. (Think of it as matching "word" characters.)

\W - Any character that is NOT a letter, numeric digit, or the underscore character.

\s - Any space, tab, or newline character. (Think of it as matching "spaces" characters.)

\S - Any character that is NOT a space, tab, or newline.

An example:

In [8]:
xmasRegex = re.compile(r'\d+\s\w+')
xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')

['12 drummers',
 '11 pipers',
 '10 lords',
 '9 ladies',
 '8 maids',
 '7 swans',
 '6 geese',
 '5 rings',
 '4 birds',
 '3 hens',
 '2 doves',
 '1 partridge']

### Making your own character classes

You can creat a character class by using square bracket. For example:

In [9]:
vowelRegex = re.compile(r'[aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']

You can create a range o characters like [A-Z] or [0-5]. Also, you don't need to use the backslash to validate some special characts like on print() function, for example, if you want to match a backslash on your character class you can just put it inside square brackets.

#### Negative Character Class

You can also make a negative character classes to match anything but the characters inside the square bracket by adding a caret character (^) just after the opening square bracket. For example:

In [10]:
vowelRegex = re.compile(r'[^aeiouAEIOU]')
vowelRegex.findall('Robocop eats baby food. BABY FOOD.')

['R',
 'b',
 'c',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

#### Stating with and Ending with
To find if a text on a string start with some characters you can use the caret character in the begining of the regex. Likewise, you can find if the text end with some characters by putting the dollar at the end of the regex. Example:

In [15]:
beginsWithHello = re.compile(r'^Hello')
print(beginsWithHello.search('Hello World'))
print(beginsWithHello.search('He said hello.') == None)
endsWithNumber = re.compile(r'\d$')
print(endsWithNumber.search('Your number is 42'))
print(endsWithNumber.search('Your number is 42.') == None)

<_sre.SRE_Match object; span=(0, 5), match='Hello'>
True
<_sre.SRE_Match object; span=(16, 17), match='2'>
True


#### The Wildcard Character

The dot character in a regular expression is called a wildcard and will match any character that is followed by some characters that are after the dot on Regex. For example.

In [17]:
atRegex = re.compile(r'.at')
atRegex.findall('The cat in the hat sat on the flat mat.}')

['cat', 'hat', 'sat', 'lat', 'mat']

Note that it will only bring the character that are behind the characters that the regex are trying to match, not the word, as you can see on the "flat" word.

#### Finding (Almost) Everything
Now, to find everything we have the dot-star Regex, where you find anything but the new line character. Example:

In [3]:
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo.group(1))
print(mo.group(2))

Al
Sweigart


The dot-star is also greedy, so if you want to do it by ungreedy way you have to input a question mark after the dot-star. Example:

In [4]:
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo.group())
greedyRegex = re.compile(r'<.*>')
mo = greedyRegex.search('<To serve man> for dinner.>')
print(mo.group())

<To serve man>
<To serve man> for dinner.>


### The Second Arguments
#### .ALLDOT

Now, to match everything (including new line character) with dot-star you need a second argument on Regex: the re.ALLDOT. Example:

In [6]:
noNewlineRegex = re.compile('.*')
print(noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group())
newlineRegex = re.compile('.*', re.DOTALL)
print(newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group())

Serve the public trust.
Serve the public trust.
Protect the innocent.
Uphold the law.


#### .IGNORECASE
Sometimes you want to match the character not matter if it is upper or lower case, for this we have another argument to do so: the re.IGNORECASE or simply re.I. Example:

In [8]:
robocop = re.compile(r'robocop', re.I)
print(robocop.search('Robocop is part man, part machine, all cop.').group())
print(robocop.search('ROBOCOP protects the innocent.').group())
print(robocop.search('Al, why does your programming book talk about robocop so much?').group())

Robocop
ROBOCOP
robocop


#### The submethod

If you wanna identify some strings on the text and substitute it by other expressions you can use the sub() method. It has two arguments: the 1st is the replacer, and the second is the replaced. Example:

In [11]:
namesRegex = re.compile(r'Agent \w+')
namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob')

'CENSORED gave the secret documents to CENSORED'

In this example the word after the word 'Agent' is replaced by 'CENSORED'.

Sometimes you want that the whole word was removed and some word are censored passing just one word. For doing that you need to use groups and use the \1, \2, \3 to refer to that group. See this example:

In [12]:
agentNamesRegex = re.compile(r'Agent (\w)\w*')
agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

'A**** told C**** that E**** knew B**** was a double agent.'

#### .VERBOSE

Sometimes you need to explain inside the re.compile what it's doing, part-by-part, but to add comments inside you need to use another arguement called verbose: re.VERBOSE. Using the phone number example you could do by doing:

In [16]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)
print(phoneRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

[('415-555-9999', '415', '-', '-', '', ''), ('212-555-0000', '212', '-', '-', '', '')]


Now it's explaing group by group what are this Regex trying to match.

#### Facing a problem: Using all three second arguments at the same time!

All of the arguments that we use above (DOTALL, IGNORECASE and VERBOSE) are the second argument of the expression, so they cannot be used at the same time. What can we do about that?

You can use the pipe character (|) !!!!

As doing so you can put this like that:

```python
re.compile(r'something', re.I | re.DOTALL | re.VERBOSE)
```

And now you can use it all together!

## Project: Phone Number and Email Adress Extractor

Your program needs to do the following:
1. Get the text off the clipboard
2. Find all the phone numbers and email adress in the text
3. Paste them on the clipboard

Now, thinking on the code, it may logic look like this:
1. Use the pyperclip module to copy and paste strings.
2. Create two regexes, one for the phone number and another for email adress.
3. Find all matches, not just the first match, of both regexes.
4. Neatly format the matched strings into a single string to paste.
5. Display somekind of message when there's not match.

And it would be done.

We'll do that in parts.

#### 1st - Create the Regex for Phone Numbers

First you need to import the modules (you can also put #! python3 before the code for it to work as a program and a comment summarizing the program as well):

In [1]:
import re, pyperclip

Now, create the Regex for the phone number, using the verbose version that is more explained.

In [2]:
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                # area code
    (\s|-|\.)?                        # separator
    (\d{3})                           # first 3 digits
    (\s|-|\.)                         # separator
    (\d{4})                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?    # extension
    )''', re.VERBOSE)

See that the area code is an optional match, and it has two options: the only three digits `\d{3}` and the parenthesis with three digits inside `(\d{3})`

Also on the options: you have the separators matching spaces `\s`, hyphen `-` and period `.`.

Another option came on the final optinal part, the expension: its a two to five digits that follows string named ext, x, ext. .

Note that all the options are signalized by a pipe character.

We've done just half of the second part of coding:
> 2. Create two regexes, one for the phone number and another for email adress.

Lets finish it.

#### 2nd - Create the Regex of Email Adress

In [3]:
#create email regex

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+      # username
    @                      # @ symbol
    [a-zA-Z0-9.-]+         # domain name
    (\.[a-zA-Z]{2,4})      # dot-something
)''', re.VERBOSE)

Note that won't be a space so that's why every part of the compile has a plus (+) on the end, only the @ doesn't need it because it is not a group. Also, you can see that the '.com' part only allow strings between 2 and 4 characters, beginning with a dot.

It will not cover all the emails due to the substantial weird rules on the providers but it will macth almost everything.

#### 3rd - Find All Matches in the Clipboard Text

Now let's use the findall() function to find all the matches.

In [4]:
#find matches on the clipboard text

text = str(pyperclip.paste())
matches = [] #init
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1],groups[3],groups[5]])
    if groups[8] != '':
        phoneNum += ' x'+groups[8]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])

Just recapping that groups[0] matches the entire regular expression.

So now it's now taking the matches and storing it, now last do the last part.

#### Join the Matches into a String for the Clipboard

In [5]:
#copy the results for the clipboard

if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to the clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

No phone numbers or email addresses found.


You can find this program on this brench.

# Finish