# 1. Regular Expressions (Regexes)

## Introduction

In the following lessons we will learn how to create basic **Regular Expressions** in Python. Regular Expressions, also known as Regexes, are used to find different patterns of text. In general, regexes work by first specifying the rules for the set of possible patterns that you want to find and then making queries such as "Is this pattern found at the beginning of this string?" or “Is there a match for this pattern anywhere in this string?”. We will learn for example, how to write regular expressions to find phone numbers, names, and email addresses. 

By the end this lesson you should be able to read and write basic regular expressions in Python and know how to apply them to get useful financial information from 10-Ks.

## Raw Strings

Before we dive in and start creating our first regular expression, let's take a quick look at **Raw Strings**, since we will be using them to create our regexes.

In Python string literals are specified using either single quotes (`'`) or double quotes (`"`); and the backslash (`\`) character is used to escape characters that have a special meaning, such as a newline (`\n`) or tab (`\t`). Let's see a simple example:

In [14]:
print('Hello\n\tWorld')

Hello
	World


We can clearly see that the `print()` function has replaced the `\n` with a new line, and the `\t` with a tab. 

In some cases, however, you may want the `print()` function to interpret the string literally. This means that you don’t want characters preceded by a backslash (`\`) to be interpreted as special characters. In these cases, you can prefix the string literal with the letter `r`. Such strings are known as **Raw Strings** and treat backslashes (`\`) as literal characters. To see how this works, let's print the same string literal we had before but now as a raw string:

In [15]:
print(r'Hello\n\tWorld')

Hello\n\tWorld


We can clearly see that by adding an `r` before the first quote of the string literal, both `\n` and `\t`, are no longer treated as special characters. It is important to note, that the `r` doesn't change the type of the string literal, but rather, it just changes how the string literal is interpreted. So, without the `r`, backslashes are used to escape characters and with the `r`, backslashes are treated as literal characters. 

We will be using raw strings to create our regular expressions, because regular expressions themselves, also use the backslash character (`\`) to indicate their own special characters. Therefore, by using raw strings, we avoid the problem of Python interpreting the special characters in regexes in the wrong way.

# 2. Finding Words Using Regexes

In this notebook we will learn how to find letters and words in a string using regular expressions. Throughout these lessons, we will use the `re` module from Python's standard library to work with regular expressions. The `re` module not only contains functions that allow us to check if a given regular expression matches a particular string, but also contains functions that allow us to modify strings in various ways. 

Let’s begin by using a regular expression to find all the locations of a single letter in a given string. To do this, we will use the `re.compile()` function from the `re` module. The `re.compile(pattern)` function converts a regular expression `pattern` into a regular expression object. This allows us to save our regular expressions into objects that can be used later to perform pattern matching using various methods, such as `.match()`, `.search()`, `.findall()`, and `.finditer()`. Let’s see how this works.

In the code below, we will find all the locations of the letter `a` in a string named `sample_text`. In this case, our regular expression pattern will just be `'a'` and we will pass it to the `re.compile()` function as a raw string. We will save the regular expression object returned by the `re.compile()` function in a variable called `regex`. We will then use the `.finditer()` method to search our `sample_text` for the given regular expression contained in the `regex` object. The `.finditer()` method returns an iterator with all the non-overlapping matches of our regular expression pattern in the string. We should also mention that the `.finditer()` method scans the strings from left-to-right, and returns the matches in the order found. Since the `.finditer()` method returns an iterator, we can loop through it to print all the matches, as shown below:

In [16]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'a'
regex = re.compile(r'a')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(11, 12), match='a'>
<re.Match object; span=(17, 18), match='a'>
<re.Match object; span=(22, 23), match='a'>


We can see that each match corresponds to a Match Object with a given `span` and corresponding `match`. The `span=(start,end)` is a tuple that indicates the `start` and `end` indices of the given `match` in the string `sample_text`. For example, if we look at the `span` of the first match, we can see that the first `a` is located between indices `6` through `7`. Therefore, if we print the `sample_text` string from index `6` through `7` we will see that it corresponds to the letter `a`:

In [17]:
# Print the sample_text string from index 6 through 7
print(sample_text[6:7])

a


Notice, however that even though the first letter in our `sample_text` is an uppercase `A`, the `.finditer()` method didn't return it as a match. This is because, regular expressions are case sensitive. Therefore, in order to match this uppercase `A` we will need to use `'A'` as our regular expression, as shown below:

In [18]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'A'
regex = re.compile(r'A')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='A'>


Notice that now, the `.finditer()` method only returned one match, since there is only one uppercase `A` in our `sample_text`. Also, notice that the `span=(0,1)` tells us that the uppercase `A` is the first letter in the `sample_text` string. 

We should note that the `re` module allows us to perform **case-insensitive** searches by the means of **Flags**. For example, we might want to search our string for the letter `a`, regardless if it is uppercase or lowercase. We will learn about flags in a later lesson. 

Besides searching for single letter, we can also search for groups of letters. This is done in exactly the same manner as with single letters. Let's search for the word `walking` in our `sample_text` string:

In [19]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'walking'
regex = re.compile(r'walking')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<re.Match object; span=(21, 28), match='walking'>

Match from the original text: walking


Notice that we only get one match, since there is only one instance of the word `walking` in our `sample_text`. Also, notice that in the above example we used the ` match.span()` method to get the start and end indices of our match. 

When using regular expressions to search for groups of letters, we should note that the order of the letters matters. For example, if we were to search for `ginwakl` in our `sample_text`, we wouldn't find any matches even though the same group of letters are contained in the word walking, as shown in the code below:

In [20]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression 'ginwakl'
regex = re.compile(r'ginwakl')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

We can clearly see that there are no matches because the `.finditer()` method is looking for those letters in that particular order in our `sample_text` string.

## TODO: Find Words

In the cell below, the `sample_text` string contains the name Walter Brown written in a mixture of uppercase and lowercase letters. Write a regular expression that matches the name `WaLtEr BroWN` and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Then, write a loop to print all the `matches` found by the `.finditer()` method . Finally, use the ` match.span()` method to print the match from the `sample_text` string.

In [21]:
# import re module


# Sample text
sample_text = 'Alice and WaLtEr BroWN are talking with wAlTer Jackson.'

# Create a regular expression object with the regular expression 'WaLtEr BroWN'
regex = re.compile(r'WaLtEr BroWN')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<re.Match object; span=(10, 22), match='WaLtEr BroWN'>

Match from the original text: WaLtEr BroWN


## Matching a Period (`.`)

Now, let's use a regular expression to find the period (`.`) at the end of our `sample_text` string. Let's search for the period in the same manner as we did for single letters:

In [22]:
# import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression '.'
regex = re.compile(r'.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='A'>
<re.Match object; span=(1, 2), match='l'>
<re.Match object; span=(2, 3), match='i'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='e'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(7, 8), match='n'>
<re.Match object; span=(8, 9), match='d'>
<re.Match object; span=(9, 10), match=' '>
<re.Match object; span=(10, 11), match='W'>
<re.Match object; span=(11, 12), match='a'>
<re.Match object; span=(12, 13), match='l'>
<re.Match object; span=(13, 14), match='t'>
<re.Match object; span=(14, 15), match='e'>
<re.Match object; span=(15, 16), match='r'>
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(17, 18), match='a'>
<re.Match object; span=(18, 19), match='r'>
<re.Match object; span=(19, 20), match='e'>
<re.Match object; span=(20, 21), match=' '>
<re.Match object; span=(21, 22), match='w'>
<re.Match object; span=(22, 23), match='a'>
<re.Mat

We can clearly see that something has gone wrong, the `.finditer()` method has matched every single character in the `sample_text` string, including whitespaces, uppercase and lowercase letters, and the period at the end.

This because, in regular expressions, the `.` is a special character known as a **Metacharacter**. Metacharacters are used to give special instructions and we will learn about them in the next lesson.

# 3. Finding MetaCharacters

Here’s a complete list of the metacharacters used in regular expressions:

```python
. ^ $ * + ? { } [ ] \ | ( )
```

As we mentioned in the previous lesson, these metacharacters are used to give special instructions and can't be searched for directly. If we want to search for these metacharacters directly in strings we need to escape them first. Just like with Python string literals, we can use the backslash (`\`) to escape all the metacharacters. Let’s see an example.

Let's try to find the period (`.`) at the end of our `sample_text` again, but this time we will use a backslash (`\`) in our regular expression to remove the period's special meaning, as shown in the code below:

In [23]:
# Import re module
import re

# Sample text
sample_text = 'Alice and Walter are walking to the store.'

# Create a regular expression object with the regular expression '\.'
regex = re.compile(r'\.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(41, 42), match='.'>


We can see that now, we have managed to find only the period (`.`) at the end of the `sample_text` string, as was intended. 

To search for any of the other metacharacters we can do exactly the same thing.

## TODO: Find All The MetaCharacters

In the cell below, we have a string that contains all the metacharacters. Write a single regular expression to check that you can match all the metacharacters using a backslash, and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression.  Then, write a loop to print all the `matches` found by the `.finditer()` method. Finally, use the ` match.span()` method to print the match from the `sample_text` string.

In [24]:
# Import re module
import re

# Sample text
sample_text = '. ^ $ * + ? { } [ ] \ | ( )'

# Create a regular expression object with the regular expression 
regex = re.compile(r'\. \^ \$ \* \+ \? \{ \} \[ \] \\ \| \( \)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<re.Match object; span=(0, 27), match='. ^ $ * + ? { } [ ] \\ | ( )'>

Match from the original text: . ^ $ * + ? { } [ ] \ | ( )


## TODO: Find The Price

In the cell below, write a regular expression that matches the price of the coat bought by John and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression.  Then, write a loop to print all the `matches` found by the `.finditer()` method . Finally, use the ` match.span()` method to print the match from the `sample_text` string.

In [25]:
# Import re module
import re

# Sample text
sample_text = 'John bought a winter coat for $25.99 dollars.'

# Create a regular expression object with the regular expression
regex = re.compile(r'\$25\.99')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)
    
    # Using the span information from the match, print the match from the original string
    print('\nMatch from the original text:', sample_text[match.span()[0]:match.span()[1]])

<re.Match object; span=(30, 36), match='$25.99'>

Match from the original text: $25.99


# 4. Searching For Simple Patterns

Being able to match letters and metacharacters is the simplest task that regular expressions can do. In this section we will see how we can use regular expressions to perform more complex pattern matching. We can form any pattern we want by using the metacharacters mentioned in the previous lesson.

The first metacharacter we are going to look at is the backslash (`\`). We already saw that the backslash can be used to escape all the metacharacters, so that you can search for them directly. However, the backslash can also be followed by various characters to signal various special sequences. Here is a list of the special sequences we are going to look at in this notebook:

* `\d` - Matches any decimal digit; this is equivalent to the set [0-9]


* `\D` - Matches any non-digit character; this is equivalent to the set [^0-9]


* `\s` - Matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v]


* `\S` - Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]


* `\w` - Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]


* `\W` - Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]

We can see that there is a difference between lowercase and uppercase sequences. For example, while `\d` matches any digit, `\D` matches everything that is **not** a digit. Similarly, while `\s` matches any whitespace character, `\S` matches everything that is **not** a whitespace character; and while `\w` matches any alphanumeric character, `\W` matches everything that is **not** an alphanumeric character.

Let's start by learning how to use `\d` to search for decimal digits.

## Matching Numbers Using `\d`

In the code below, we will use `'\d'` as our regular expression to find all the decimal digits in our `sample_text` string:

In [26]:
# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\d'
regex = re.compile(r'\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(15, 16), match='1'>
<re.Match object; span=(16, 17), match='2'>
<re.Match object; span=(17, 18), match='3'>
<re.Match object; span=(18, 19), match='0'>
<re.Match object; span=(46, 47), match='1'>
<re.Match object; span=(47, 48), match='5'>
<re.Match object; span=(48, 49), match='6'>
<re.Match object; span=(49, 50), match='7'>
<re.Match object; span=(50, 51), match='8'>
<re.Match object; span=(51, 52), match='9'>


As we can see, all the matches found above correspond to only decimal digits between 0 and 9.

Conversely, if wanted to find all the characters that are **not** decimal digits, we will use `\D` as our regular expression, as shown below:

In [28]:
# Import re module
import re

# Sample text
sample_text = 'Alice lives in 1230 First St., Ocean City, MD 156789.'

# Create a regular expression object with the regular expression '\D'
regex = re.compile(r'\D')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='A'>
<re.Match object; span=(1, 2), match='l'>
<re.Match object; span=(2, 3), match='i'>
<re.Match object; span=(3, 4), match='c'>
<re.Match object; span=(4, 5), match='e'>
<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(6, 7), match='l'>
<re.Match object; span=(7, 8), match='i'>
<re.Match object; span=(8, 9), match='v'>
<re.Match object; span=(9, 10), match='e'>
<re.Match object; span=(10, 11), match='s'>
<re.Match object; span=(11, 12), match=' '>
<re.Match object; span=(12, 13), match='i'>
<re.Match object; span=(13, 14), match='n'>
<re.Match object; span=(14, 15), match=' '>
<re.Match object; span=(19, 20), match=' '>
<re.Match object; span=(20, 21), match='F'>
<re.Match object; span=(21, 22), match='i'>
<re.Match object; span=(22, 23), match='r'>
<re.Match object; span=(23, 24), match='s'>
<re.Match object; span=(24, 25), match='t'>
<re.Match object; span=(25, 26), match=' '>
<re.Match object; span=(26, 27), match='S'>
<re.Mat

We can see that none of the matches are decimal digits. We also see, that by using `\D` we were able to match all characters, including periods (`.`) and white spaces.

## TODO: Find IP Addresses

In the cell below, our `sample_text` string contains three IP addresses. Write a single regular expression that can match any IP address and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression.  Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINT :** Use the special sequence `\d` and take advantage that all IP addresses have the same pattern.

In [29]:
# Import re module
import re

# Sample text
sample_text = 'Here are three IP address: 123.456.789.123, 999.888.777.666, 111.222.333.444'

# Create a regular expression object with the regular expression
regex = re.compile(r'\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(27, 42), match='123.456.789.123'>
<re.Match object; span=(44, 59), match='999.888.777.666'>
<re.Match object; span=(61, 76), match='111.222.333.444'>


If you wrote your regex correctly you should see three matches above corresponding to the three IP addresses in our `sample_text` string.

## Matching Whitespace Characters Using `\s`

In the code below, we will use `\s` as our regular expression to find all the whitespace characters in our `sample_text` string. For this example, we will use a string literal that spans multiple lines. To create this multi-line string, we will use triple-quotes (`'''`) both at the beginning and at the end of the multi-line string.

In [33]:
# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\s'
regex = re.compile(r'\s')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(1, 2), match='\t'>
<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(13, 14), match=' '>
<re.Match object; span=(17, 18), match='\x0c'>
<re.Match object; span=(18, 19), match='\n'>
<re.Match object; span=(23, 24), match=' '>
<re.Match object; span=(29, 30), match=' '>
<re.Match object; span=(33, 34), match='\r'>
<re.Match object; span=(34, 35), match='\n'>
<re.Match object; span=(40, 41), match=' '>
<re.Match object; span=(46, 47), match=' '>
<re.Match object; span=(49, 50), match=' '>
<re.Match object; span=(57, 58), match='\x0b'>
<re.Match object; span=(58, 59), match='\n'>


As we can see, all the matches found correspond to white spaces, tabs (`\t`), newlines (`\n`), carriage returns (`\r`), form feeds (`\f`), and vertical tabs (`\v`). Notice that form feeds appear as `\x0c` and vertical tabs as `\x0b`. 

Conversely, if wanted to find all the characters that are **not** whitespace characters, we will use `\S` as our regular expression, as shown below:

In [34]:
# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '\S'
regex = re.compile(r'\S')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(2, 3), match='A'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='i'>
<re.Match object; span=(5, 6), match='c'>
<re.Match object; span=(6, 7), match='e'>
<re.Match object; span=(8, 9), match='l'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='v'>
<re.Match object; span=(11, 12), match='e'>
<re.Match object; span=(12, 13), match='s'>
<re.Match object; span=(14, 15), match='i'>
<re.Match object; span=(15, 16), match='n'>
<re.Match object; span=(16, 17), match=':'>
<re.Match object; span=(19, 20), match='1'>
<re.Match object; span=(20, 21), match='2'>
<re.Match object; span=(21, 22), match='3'>
<re.Match object; span=(22, 23), match='0'>
<re.Match object; span=(24, 25), match='F'>
<re.Match object; span=(25, 26), match='i'>
<re.Match object; span=(26, 27), match='r'>
<re.Match object; span=(27, 28), match='s'>
<re.Match object; span=(28, 29), match='t'>
<re.Match object; span=(30, 31), match='S'>
<

We can see that none of the matches above are whitespace characters. We also see, that by using `\S` we were able to match all characters, including periods (`.`), letters, and numbers.

## TODO: Print The Numbers Between Whitespace Characters

In the cell below, our `sample_text`  consists of a multi-line string with numbers in between whitespace characters:

```python
123	45	7895
1	222	33
```

Notice that not all the numbers have the same number of digits. For example, the first number (`123` ) has three digits, but the second number (`45` ) only has two digits.

Write a single regular expression that finds the tabs (`\t`) and the newlines (`\n`) in this multi-line string and save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Then, write a loop that uses the span information from each `match` to only print the numbers found in the original multi-line string. Your code should work in the general case where the numbers can have any number of digits. For example, if the numbers in the string were to change your code should still be able to find them and print them. Finally, in this exercise you cannot use `\d` in your regular expression. 

**HINT :** Notice that there are no whites paces in the multiline string. Use the `\s` sequence to find the tabs and newlines. Then notice that you can use the span's `end` and `start` index from consecutive matches to figure out the number of digits of each number. Use these indices to print the numbers found in the original multi-line string. You can use the `match.span()` method we saw before to find the `start` and `end` indices of each `match`. Alternatively, you can also use the `.start()` and `.end()` methods to extract the `start` and `end` indices of each match. The `match.start()` is equivalent to `match.span()[0]` and `match.end()` is equivalent to `match.span()[1]`.

In [35]:
# Import re module
import re

# Sample text
sample_text = '''
123\t45\t7895
1\t222\t33
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = re.compile(r'\s')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Set counter
counter = 0

# Write a loop to print all the numbers found in the original string
for match in matches:    
    if counter != 0:
        start_idx = match.start()        
        print('\nNumbers from the original text:', sample_text[end_idx:start_idx])        
    end_idx = match.end()
    counter += 1

Sample Text:
 
123	45	7895
1	222	33


Numbers from the original text: 123

Numbers from the original text: 45

Numbers from the original text: 7895

Numbers from the original text: 1

Numbers from the original text: 222

Numbers from the original text: 33


### Matching Alphanumeric Characters Using `\w`

In the code below, we will use `\w` as our regular expression to find all the alphanumeric characters in our `sample_text` string. This includes the underscore ( `_` ), all the numbers from 0 through 9, and all the uppercase and lowercase letters:

In [36]:
# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\w'
regex = re.compile(r'\w')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='Y'>
<re.Match object; span=(2, 3), match='o'>
<re.Match object; span=(3, 4), match='u'>
<re.Match object; span=(5, 6), match='c'>
<re.Match object; span=(6, 7), match='a'>
<re.Match object; span=(7, 8), match='n'>
<re.Match object; span=(9, 10), match='c'>
<re.Match object; span=(10, 11), match='o'>
<re.Match object; span=(11, 12), match='n'>
<re.Match object; span=(12, 13), match='t'>
<re.Match object; span=(13, 14), match='a'>
<re.Match object; span=(14, 15), match='c'>
<re.Match object; span=(15, 16), match='t'>
<re.Match object; span=(17, 18), match='F'>
<re.Match object; span=(18, 19), match='A'>
<re.Match object; span=(19, 20), match='K'>
<re.Match object; span=(20, 21), match='E'>
<re.Match object; span=(22, 23), match='C'>
<re.Match object; span=(23, 24), match='o'>
<re.Match object; span=(24, 25), match='m'>
<re.Match object; span=(25, 26), match='p'>
<re.Match object; span=(26, 27), match='a'>
<re.Match object; span=(27, 28), match='n'>
<

As we can see, all the matches found correspond to alphanumeric characters only, including the underscore in the email address.

Conversely, if wanted to find all the characters that are **not** alphanumeric characters, we will use `\W` as our regular expression, as shown below:

In [37]:
# Import re module
import re

# Sample text
sample_text = '''
You can contact FAKE Company at:
fake_company12@email.com.
'''

# Create a regular expression object with the regular expression '\W'
regex = re.compile(r'\W')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='\n'>
<re.Match object; span=(4, 5), match=' '>
<re.Match object; span=(8, 9), match=' '>
<re.Match object; span=(16, 17), match=' '>
<re.Match object; span=(21, 22), match=' '>
<re.Match object; span=(29, 30), match=' '>
<re.Match object; span=(32, 33), match=':'>
<re.Match object; span=(33, 34), match='\n'>
<re.Match object; span=(48, 49), match='@'>
<re.Match object; span=(54, 55), match='.'>
<re.Match object; span=(58, 59), match='.'>
<re.Match object; span=(59, 60), match='\n'>


We can see that none of the matches are alphanumeric characters. We also see, that by using `\W` we were able to match all whitespace characters, and the `@` symbol in the email address.

## TODO: Find emails

In the cell below, our `sample_text` consists of a multi-line string that contains three email addresses:

```
j.s@email.com
a.w@email.com
m.j@email.com
```

Notice, that all three email address have the same pattern, namely, the first name initial, followed by a dot (`.`), followed by the last name initial, and ending in ``` @email.com```. 

Take advantage of the fact that all three email addresses have the same pattern to write a single regular expression that can find all three email addresses in our `sample_text` string. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

In [38]:
# Import re module
import re

# Sample text
sample_text = '''
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com
'''

# Print sample_text
print('Sample Text:\n', sample_text)

# Create a regular expression object with the regular expression
regex = re.compile(r'\w\.\w@email.com')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

Sample Text:
 
John Sanders: j.s@email.com
Alice Walters: a.w@email.com
Mary Jones: m.j@email.com

<re.Match object; span=(15, 28), match='j.s@email.com'>
<re.Match object; span=(44, 57), match='a.w@email.com'>
<re.Match object; span=(70, 83), match='m.j@email.com'>


# 5. Word Boundaries

We will now learn about another special sequence that you can create using the backslash:

* `\b`

This special sequence doesn't really match a particular set of characters, but rather determines word boundaries. A word in this context is defined as a sequence of alphanumeric characters, while a boundary is defined as a white space, a non-alphanumeric character, or the beginning or end of a string. We can have boundaries either before or after a word. Let's see how this works with an example.

In the code below, our `sample_text` string contains the following sentence:

```
The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.
```

As we can see the word `class` appears in three different positions:

1. As a stand-alone word: The word `class` has white spaces both before and after it.


2. At the beginning of a word: The word `class`  in `classroom` has a white space before it.


3. At the end of a word: The word `class`  in `subclass` has a whitespace after it.

If we use `class` as our regular expression, we will match the word `class` in all three positions as shown in the code below:

In [39]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class'
regex = re.compile(r'class')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 17), match='class'>
<re.Match object; span=(47, 52), match='class'>
<re.Match object; span=(85, 90), match='class'>


We can see that we have three matches, corresponding to all the instances of the word `class` in our `sample_text` string.

Now, let's use word boundaries to only find the word `class` when it appears in particular positions. Let’s start by using `\b` to only find the word `class` when it appears at the beginning of a word. We can do this by adding `\b` before the word `class` in our regular expression as shown below:

In [40]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass'
regex = re.compile(r'\bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 17), match='class'>
<re.Match object; span=(47, 52), match='class'>


We can see that now we only have two matches because it's only matching the stand-alone word, `class`, and the `class` in `classroom` since both of them have a word boundary (in this case a white space) directly before them. We can also see that it is not matching the `class` in `subclass` because there is no word boundary directly before it. 

Now, let's use `\b` to only find the word `class` when it appears at the end of a word. We can do this by adding `\b` after the word `class` in our regular expression as shown below:

In [41]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class\b'
regex = re.compile(r'class\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 17), match='class'>
<re.Match object; span=(85, 90), match='class'>


We can see that in this case we have two matches as well because it's matching the stand-alone word, `class` again, and the `class` in `subclass` since both of them have a word boundary (in this case a white space) directly after them. We can also see that it is not matching the `class` in `classroom` because there is no word boundary directly after it.

Now, let's use `\b` to only find the word `class` when it appears as a stand-alone word. We can do this by adding `\b` both before and after the word `class` in our regular expression as shown below:

In [42]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\bclass\b'
regex = re.compile(r'\bclass\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 17), match='class'>


We can see that now we only have one match because the stand-alone word, `class`, is the only one that has a word boundary (in this case a white space) directly before and after it.

## TODO: Find All 3-Letter Words

In the cell below, write a regular expression that can match all 3-letter words in the `sample_text` string. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

In [43]:
# Import re module
import re

# Sample text
sample_text = 'John went to the store in his car, but forgot to buy bread.'

# Create a regular expression object with the regular expression
regex = re.compile(r'\b\w\w\w\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(13, 16), match='the'>
<re.Match object; span=(26, 29), match='his'>
<re.Match object; span=(30, 33), match='car'>
<re.Match object; span=(35, 38), match='but'>
<re.Match object; span=(49, 52), match='buy'>


## Not A Word Boundary

As with the other special sequences that we saw before, we also have the uppercase version of `\b`, namely:

* `\B`

As with the other special sequences, `\B` indicates the opposite of `\b`. So if `\b` is used to indicate a word boundary, `\B` is used to indicate **not** a word boundary. Let's see how this works:

Let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly before it. We can do this by adding `\B` before the word `class` in our regular expression as shown below:

In [44]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass'
regex = re.compile(r'\Bclass')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(85, 90), match='class'>


We can see that we only get one match because the `class` in `subclass` is the only one that **doesn't** have a word boundary directly before it. 

Now, let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly after it. We can do this by adding `\B` after the word `class` in our regular expression as shown below:

In [45]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression 'class\B'
regex = re.compile(r'class\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(47, 52), match='class'>


We can see that again we only have one match because the `class` in `classroom` is the only one that **doesn't** have a boundary directly after it. 

Finally, let's use `\B` to only find the word `class` when it **doesn't** have a word boundary directly before or after it. We can do this by adding `\B` both before and after the word `class` in our regular expression as shown below:

In [46]:
# Import re module
import re

# Sample text
sample_text = 'The biology class will meet in the first floor classroom to learn about Theria, a subclass of mammals.'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

In this case, we can see that we get no matches. This is because all instances of the word `class` in our `sample_text` string, have a boundary either before or after it. In order to have a match in this case, the word `class` will have to appear in the middle of a word, such as in the word `declassified`. Let's see an example:

In [47]:
# Import re module
import re

# Sample text
sample_text = 'declassified'

# Create a regular expression object with the regular expression '\Bclass\B'
regex = re.compile(r'\Bclass\B')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(2, 7), match='class'>


## TODO: Finding Last Digits

In the cell below, our `sample_text` string contains some numbers separated by whitespace characters.

Write code that uses a regular expression to count how many numbers (greater than 3), have 3 as their last digit. For example, 93 is greater than 3 and its last digit is 3, so your code should count this number as a match. However, the number 3 by itself should not be counted as a match. 

As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Then, write a loop to print all the `matches` found by the `.finditer()` method. Finally, print the total number of matches.

In [49]:
# Import re module
import re

# Sample text
sample_text = '203 3 403 687 283 234 983 345 23 3 74 978'

# Create a regular expression object with the regular expression
regex = re.compile(r'\B3\b')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Set counter
num_matches = 0

# Print all the matches
for match in matches:
    print(match)
    num_matches += 1
    
# Print the total number of matches    
print('\nTotal Number of Matches:', num_matches)

<re.Match object; span=(2, 3), match='3'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(16, 17), match='3'>
<re.Match object; span=(24, 25), match='3'>
<re.Match object; span=(31, 32), match='3'>

Total Number of Matches: 5


If you wrote your code correctly you should get a total of 5 matches.

# 6. Simple MetaCharacters

As we indicated in a previous lesson, regular expressions use metacharacters to give special instructions. Here again is a complete list of all the metacharacters used in regular expressions:

```python
. ^ $ * + ? { } [ ] \ | ( )
```
We already learned how to use one of these metacharacters, the backslash (`\`), to create special sequences. In the following lessons we will learn how to use the remaining metacharacters to create more complicated regular expressions. 

In this notebook, we will take a look at the following metacharacters:

```python
. ^ $
```

Let’s start by looking at the dot (`.`) metacharacter.

## The Dot (`.`)

As we saw in a previous lesson, the dot (`.`) matches any character except for newline (`\n`) characters. In the code below, we will use `.` as our regular expression to find all the characters in our multi-line `sample_text` string:

In [50]:
# Import re module
import re

# Sample text
sample_text = '''
\tAlice lives in:\f
1230 First St.\r
Ocean City, MD 156789.\v
'''

# Create a regular expression object with the regular expression '.'
regex = re.compile(r'.')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 2), match='\t'>
<re.Match object; span=(2, 3), match='A'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='i'>
<re.Match object; span=(5, 6), match='c'>
<re.Match object; span=(6, 7), match='e'>
<re.Match object; span=(7, 8), match=' '>
<re.Match object; span=(8, 9), match='l'>
<re.Match object; span=(9, 10), match='i'>
<re.Match object; span=(10, 11), match='v'>
<re.Match object; span=(11, 12), match='e'>
<re.Match object; span=(12, 13), match='s'>
<re.Match object; span=(13, 14), match=' '>
<re.Match object; span=(14, 15), match='i'>
<re.Match object; span=(15, 16), match='n'>
<re.Match object; span=(16, 17), match=':'>
<re.Match object; span=(17, 18), match='\x0c'>
<re.Match object; span=(19, 20), match='1'>
<re.Match object; span=(20, 21), match='2'>
<re.Match object; span=(21, 22), match='3'>
<re.Match object; span=(22, 23), match='0'>
<re.Match object; span=(23, 24), match=' '>
<re.Match object; span=(24, 25), match='F'>
<

As we can see, we were able to match all the characters in our `sample_text` string, except for newline characters.

## The Caret (`^`)

The caret (`^`) is used to match a sequence of characters when they appear at the beginning of a string. Let's take a look at an example.

In the code below, our `sample_text` string has the word `this` written twice:

```
this watch belongs in this box.
```

As we can see, the first instance of the word `this` occurs at the beginning of the string; while the second instance of the word `this` occurs towards the end of the string.

If we use `this` as our regular expression, we will match both instances of the word as shown in the code below:

In [51]:
# Import re module
import re

# Sample text
sample_text = 'this watch belongs in this box.'

# Create a regular expression object with the regular expression 'this'
regex = re.compile(r'this')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='this'>
<re.Match object; span=(22, 26), match='this'>


We can clearly see that we get two matches that correspond to both instances of the word `this` in our `sample_text` string.

Now, let's use the caret to only find the word `this` that appears at the beginning of the string. We can do this by adding the caret (`^`) before the word `this` in our regular expression as shown below:

In [52]:
# Import re module
import re

# Sample text
sample_text = 'this watch belongs in this box.'

# Create a regular expression object with the regular expression '^this'
regex = re.compile(r'^this')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='this'>


We can see that now, we only get one match, corresponding to the word `this` that appears at the beginning of the string. It didn't match the second instance of word `this` because it wasn't at the beginning of our `sample_text` string.

## The Dollar Sign (`$`)

The dollar sign (`$`) is used to match a sequence of characters when they appear at the end of a string. Let's take a look at an example.

In the code below, our `sample_text` string has the word `watch` written twice:

```
this watch is better than this watch
```

As we can see, the first instance of the word `watch` occurs towards the beginning of the string; while the second instance of the word `watch` occurs at the end of the string.

If we use `watch` as our regular expression, we will match both instances of the word as shown in the code below:

In [54]:
# Import re module
import re

# Sample text
sample_text = 'this watch is better than this watch'

# Create a regular expression object with the regular expression 'watch'
regex = re.compile(r'watch')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(5, 10), match='watch'>
<re.Match object; span=(31, 36), match='watch'>


We can clearly see that we get two matches that correspond to both instances of the word `watch` in our `sample_text` string.

Now, let's use the dollar sign to only find the word `watch` that appears at the end of the string. We can do this by adding the dollar sign (`$`) after the word `watch` in our regular expression as shown below:

In [55]:
# Import re module
import re

# Sample text
sample_text = 'this watch is better than this watch'

# Create a regular expression object with the regular expression 'watch$'
regex = re.compile(r'watch$')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(31, 36), match='watch'>


We can see that now, we only get one match, corresponding to the word `watch` that appears at the end of the string. It didn't match the first instance of word `watch` because it wasn't at the end of our `sample_text` string.

# 7. Character Sets

In this lesson, we will continue to look at metacharacters. In particular, we will learn how to look for phone numbers by employing the following metacharacters:

```python
{} []
```

## Finding Phone Numbers

In the code below, our `sample_text` consists of a multi-line string that mimics a phone book:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

We can notice that even though all the phone numbers have different digits, they all have the same pattern, namely, 3 digits followed by a single character, followed by 3 more digits, followed by another single character, followed by 4 digits. We will take advantage of this pattern to create a regular expression that can match all these phone numbers. To do this, we will use the special sequence `\d` and the dot (`.`) in our regular expression, as shown in the code below:

In [56]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers in our sample_text
regex = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 24), match='555-123-4567'>
<re.Match object; span=(37, 49), match='455 555 4549'>
<re.Match object; span=(63, 75), match='655-777-7346'>
<re.Match object; span=(89, 101), match='555)999-8464'>


We can see that we managed to find all the phone numbers in our multi-line string even though, they all have different digits and different characters in between the groups of numbers. Notice that by using the dot we were able to match either the dash (`-`), the white space (` `), and the parenthesis `)` separating the groups of numbers. By using the dot we avoid having to create three different regular expressions to match the three possible characters separating the groups of numbers.

Now we can write the above regular expression in a more compact form by using the `{ }` metacharacters. The sequence `{m}` specifies that exactly `m` copies of the previous regular expression should be matched. For example, the sequence `\d{3}` specifies that exactly `3` copies of the `\d` regular expression should be matched. Therefore, the sequence `\d{3}` is equivalent to the sequence ` \d\d\d`.

Consequently, we can employ the `{}` metacharacters to write the previous code in a more compact form, as shown below:

In [57]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers in our sample_text using the {} metacharacters
regex = re.compile(r'\d{3}.\d{3}.\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 24), match='555-123-4567'>
<re.Match object; span=(37, 49), match='455 555 4549'>
<re.Match object; span=(63, 75), match='655-777-7346'>
<re.Match object; span=(89, 101), match='555)999-8464'>


As we can see, we get the same result as before.

## Finding Phone Numbers With Specific Separators

Now let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash (`-`) or a white space (` `). In this case we can use what is known as a **Character Set**. Character sets are specified using the `[]` metacharacters and are used to indicate a set of characters that you wish to match. Let’s see an example.

In the code below, we employ the character set `[-  ]` (notice that there is a whitespace after the dash) in our regular expression to only match phone numbers whose groups of numbers are separated by either a dash (`-`) or a white space (` `):

In [58]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 24), match='555-123-4567'>
<re.Match object; span=(37, 49), match='455 555 4549'>
<re.Match object; span=(63, 75), match='655-777-7346'>


We can clearly see that now, we only match the phone numbers that have either a dash (`-`) or a white space (` `) as a separator. Notice, the last phone number is not matched because even though the last group of numbers is separated by a dash (`-`), the first group of numbers is separated by a parenthesis `)` which is not in our character set.

It is important to note that even though a character set can have many characters, it only matches one of those characters at a time. For example, suppose I added a white space after the dash in Mr. Brown's phone number, as shown below:

In [59]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555- 123- 4567
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

We can see that now, we get no matches. This is because the character set `[-  ]`, used in our regular expression, is only matching one of those characters at a time.  In other words, in order to get a match there must be either a dash **or** a white space separating the groups of numbers but **not** both.

## Finding Phone Numbers With Specific Separators and Area Codes

Let's see another example of a character set. Now, let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash or a white space, and that have area code `455` or `655`. Since all the area codes in our `sample_text` end in 55:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

Then, in order to find all the phone numbers that have area code `455` or `655`, we only need to indicate that the first digit in the area code must be either a `4` or a `6`. 

To do this, we can use the character set `[46]` in our regular expression to indicate that the first number should be either a `4` or a `6`, as shown in the code below:

In [60]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator and have area
# code 455 or 655
regex = re.compile(r'[46]55[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(37, 49), match='455 555 4549'>
<re.Match object; span=(63, 75), match='655-777-7346'>


We can see that we only get the two phone numbers that have area code `455` and `655`; and that have either a dash or a white space as a separator.

## Finding Phone Numbers With Specific Last Digits

Now let's suppose we wanted to look for phone numbers that end on the numbers `6`, `7`, `8`, or `9`. In this case, we could use the character set `[6789]`. However, there is a more compact form of doing this. **Within** a character set, when a dash (`-`) is placed **between** digits or letters, it is used to specify a range. For example, the character set `[6-9]` is equivalent to the character set `[6789]` and the character set `[a-f]` is equivalent to the character set `[abcdef]`. It is important to note, that when a dash is placed at the **beginning** of a character set, as we did in the previous example, the dash is taken **literally**. Let’s see how this works.

In the code below, we will use the character set `[6-9]` in our regular expression to find all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`:

In [61]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 24), match='555-123-4567'>
<re.Match object; span=(37, 49), match='455 555 4549'>
<re.Match object; span=(63, 75), match='655-777-7346'>


As we can see, we get all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`. Notice, that the last phone number is not matched because its last digit is a `4`.

Now let's suppose we wanted to find the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`. In this case we could use the character set `[1-5]`. However, we could also use the regular expression `[^6-9]` (notice the caret (`^`) at the beginning). We already learned that **outside** of a character set, the caret matches a sequence of characters when they are located at the beginning of a string. However, when the caret (`^`) appears at the **beginning** of a character set it **negates** the set. This means it matches everything that is **not** in that character set. For example, the regular expression `[^6-9]` will match any character that is **not** a `6`, `7`, `8`, or `9`. Similarly, the regular expression `[^a-zA-Z] `will match any character that is **not** a lowercase or uppercase letter. Let’s see how this works.

In the code below, we will use the character set `[^6-9]` in our regular expression to find all the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`:

In [62]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that do not end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[^6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(89, 101), match='555)999-8464'>


As we can see, we only get one match since there is only one phone number that doesn't end with the numbers `6`, `7`, `8`, or `9`.

## TODO: Find Phone Numbers With Country Codes

In the cell below, our `sample_text` consists of a multi-line string that mimics a phone book:

```
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213(555)999-8464
```

Notice that each phone number has a country calling code. The country calling codes are preceded by the `+` sign and can have anywhere from 1 to 3 numbers. Write a regular expression that can find all these phone numbers. This includes the `+` sign, the country calling code (regardless of the number of digits), and the phone number. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINT :** You can use the qualifier `{m,n}` in your regular expression.  This qualifier means there must be at least `m` repetitions, and at most `n` repetitions of the previous regular expression. For example, `a/{1,3}b` will match `a/b`, `a//b`, and `a///b`. It won’t match `ab`, which has no slashes, or `a////b`, which has four slashes.

In [63]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213(555)999-8464
'''

# Create a regular expression object with a regular expression
regex = re.compile(r'\+\d{1,3}.\d{3}.\d{3}.\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(12, 27), match='+1-555-123-4567'>
<re.Match object; span=(40, 56), match='+61 455 555 4549'>
<re.Match object; span=(70, 87), match='+375-655-777-7346'>
<re.Match object; span=(100, 117), match='+213(555)999-8464'>
