# Character Sets

In this lesson, we will continue to look at metacharacters. In particular, we will learn how to look for phone numbers by employing the following metacharacters:

```python
{} []
```

### Finding Phone Numbers

In the code below, our `sample_text` consists of a multi-line string that mimics a phone book:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

We can notice that even though all the phone numbers have different digits, they all have the same pattern, namely, 3 digits followed by a single character, followed by 3 more digits, followed by another single character, followed by 4 digits. We will take advantage of this pattern to create a regular expression that can match all these phone numbers. To do this, we will use the special sequence `\d` and the dot (`.`) in our regular expression, as shown in the code below:

In [1]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers in our sample_text
regex = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


We can see that we managed to find all the phone numbers in our multi-line string even though, they all have different digits and different characters in between the groups of numbers. Notice that by using the dot we were able to match either the dash (`-`), the white space (` `), and the parenthesis `)` separating the groups of numbers. By using the dot we avoid having to create three different regular expressions to match the three possible characters separating the groups of numbers.

Now we can write the above regular expression in a more compact form by using the `{ }` metacharacters. The sequence `{m}` specifies that exactly `m` copies of the previous regular expression should be matched. For example, the sequence `\d{3}` specifies that exactly `3` copies of the `\d` regular expression should be matched. Therefore, the sequence `\d{3}` is equivalent to the sequence ` \d\d\d`.

Consequently, we can employ the `{}` metacharacters to write the previous code in a more compact form, as shown below:

In [2]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers in our sample_text using the {} metacharacters
regex = re.compile(r'\d{3}.\d{3}.\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>
<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


As we can see, we get the same result as before.

### Finding Phone Numbers With Specific Separators

Now let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash (`-`) or a white space (` `). In this case we can use what is known as a **Character Set**. Character sets are specified using the `[]` metacharacters and are used to indicate a set of characters that you wish to match. Let’s see an example.

In the code below, we employ the character set `[-  ]` (notice that there is a whitespace after the dash) in our regular expression to only match phone numbers whose groups of numbers are separated by either a dash (`-`) or a white space (` `):

In [3]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can clearly see that now, we only match the phone numbers that have either a dash (`-`) or a white space (` `) as a separator. Notice, the last phone number is not matched because even though the last group of numbers is separated by a dash (`-`), the first group of numbers is separated by a parenthesis `)` which is not in our character set.

It is important to note that even though a character set can have many characters, it only matches one of those characters at a time. For example, suppose I added a white space after the dash in Mr. Brown's phone number, as shown below:

In [4]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555- 123- 4567
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator
regex = re.compile(r'\d{3}[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

We can see that now, we get no matches. This is because the character set `[-  ]`, used in our regular expression, is only matching one of those characters at a time.  In other words, in order to get a match there must be either a dash **or** a white space separating the groups of numbers but **not** both.

### Finding Phone Numbers With Specific Separators and Area Codes

Let's see another example of a character set. Now, let's suppose we only wanted to find phone numbers in which the groups of digits were separated by either a dash or a white space, and that have area code `455` or `655`. Since all the area codes in our `sample_text` end in 55:

```
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
```

Then, in order to find all the phone numbers that have area code `455` or `655`, we only need to indicate that the first digit in the area code must be either a `4` or a `6`. 

To do this, we can use the character set `[46]` in our regular expression to indicate that the first number should be either a `4` or a `6`, as shown in the code below:

In [5]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that have either a dash or a white space as a separator and have area
# code 455 or 655
regex = re.compile(r'[46]55[- ]\d{3}[- ]\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


We can see that we only get the two phone numbers that have area code `455` and `655`; and that have either a dash or a white space as a separator.

### Finding Phone Numbers With Specific Last Digits

Now let's suppose we wanted to look for phone numbers that end on the numbers `6`, `7`, `8`, or `9`. In this case, we could use the character set `[6789]`. However, there is a more compact form of doing this. **Within** a character set, when a dash (`-`) is placed **between** digits or letters, it is used to specify a range. For example, the character set `[6-9]` is equivalent to the character set `[6789]` and the character set `[a-f]` is equivalent to the character set `[abcdef]`. It is important to note, that when a dash is placed at the **beginning** of a character set, as we did in the previous example, the dash is taken **literally**. Let’s see how this works.

In the code below, we will use the character set `[6-9]` in our regular expression to find all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`:

In [6]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(12, 24), match='555-123-4567'>
<_sre.SRE_Match object; span=(37, 49), match='455 555 4549'>
<_sre.SRE_Match object; span=(63, 75), match='655-777-7346'>


As we can see, we get all the phone numbers that end on the numbers `6`, `7`, `8`, or `9`. Notice, that the last phone number is not matched because its last digit is a `4`.

Now let's suppose we wanted to find the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`. In this case we could use the character set `[1-5]`. However, we could also use the regular expression `[^6-9]` (notice the caret (`^`) at the beginning). We already learned that **outside** of a character set, the caret matches a sequence of characters when they are located at the beginning of a string. However, when the caret (`^`) appears at the **beginning** of a character set it **negates** the set. This means it matches everything that is **not** in that character set. For example, the regular expression `[^6-9]` will match any character that is **not** a `6`, `7`, `8`, or `9`. Similarly, the regular expression `[^a-zA-Z] `will match any character that is **not** a lowercase or uppercase letter. Let’s see how this works.

In the code below, we will use the character set `[^6-9]` in our regular expression to find all the phone numbers that **do not** end on the numbers `6`, `7`, `8`, or `9`:

In [7]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: 555-123-4567
Mrs. Smith: 455 555 4549
Mr. Jackson: 655-777-7346
Ms. Wilson: (555)999-8464
'''

# Create a regular expression object with a regular expression that can match all the
# phone numbers that do not end on the numbers 6, 7, 8, or 9.
regex = re.compile(r'\d{3}.\d{3}.\d{3}[^6-9]')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(89, 101), match='555)999-8464'>


As we can see, we only get one match since there is only one phone number that doesn't end with the numbers `6`, `7`, `8`, or `9`.

# TODO: Find Phone Numbers With Country Codes

In the cell below, our `sample_text` consists of a multi-line string that mimics a phone book:

```
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213(555)999-8464
```

Notice that each phone number has a country calling code. The country calling codes are preceded by the `+` sign and can have anywhere from 1 to 3 numbers. Write a regular expression that can find all these phone numbers. This includes the `+` sign, the country calling code (regardless of the number of digits), and the phone number. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINT :** You can use the qualifier `{m,n}` in your regular expression.  This qualifier means there must be at least `m` repetitions, and at most `n` repetitions of the previous regular expression. For example, `a/{1,3}b` will match `a/b`, `a//b`, and `a///b`. It won’t match `ab`, which has no slashes, or `a////b`, which has four slashes.

In [None]:
# Import re module
import re

# Sample text
sample_text = '''
Mr. Brown: +1-555-123-4567
Mrs. Smith: +61 455 555 4549
Mr. Jackson: +375-655-777-7346
Ms. Wilson: +213(555)999-8464
'''

# Create a regular expression object with a regular expression
regex = re.compile(r'\+\d{1,3}.\d{3}.\d{3}.\d{4}')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

# Finding Complicated Patterns

In this lesson, we will learn how to use the remaining metacharacters in our list, namely:

```python
* + ? | ( )
```
We will employ these metacharacters to find more complicated patterns of text. 

### Finding Names

In the code below, our `sample_text` consists of a multi-line string that contains the names and heights of the 4 highest mountains in the world according to Wikipedia:

```
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
```

Let's create a regular expression that will allow us to find the names of these mountains. The first thing to notice is that the word mountain has been abbreviated in two different ways, as `Mt.` and as `Mt` (without the period). Therefore, if we want to find all the names of the mountains we need to indicate in our regular expression that the period (`.`) in the abbreviation is optional. We can do this by using the `?` metacharacter in our regular expression. The `?` will match 0 or 1 repetitions of the preceding regular expression. For example, the regular expression `ab?` will match either `a` or `ab`. In other words, the `?` after the `b` indicates that the `b` after the `a` is optional. Let’s see how this works.

In the code below, we employ the `?` metacharacter to indicate that the period (`.`) after `Mt` is optional by using the regular expression `Mt\.?`:

In [1]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''

# Create a regular expression object with a regular expression 'Mt\.?'
regex = re.compile(r'Mt\.?')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 3), match='Mt'>
<re.Match object; span=(28, 31), match='Mt.'>
<re.Match object; span=(51, 53), match='Mt'>
<re.Match object; span=(84, 87), match='Mt.'>


We can clearly see that the regular expression `Mt\.?` was able to match either `Mt` or `Mt.`

Now let's continue creating our regular expression so that it can match all the mountain names. 
We continue by matching the next character after the abbreviation. We notice that after each abbreviation there is a white space, therefore,  we will use the special sequence `\s` to match it.

After that white space, we have the name of mountain. We can see that the first letter in all the names is an uppercase letter, so we will use the character set `[A-Z]` to match any possible uppercase letter.

Now comes the tricky part. We can see that the mountain names have different lengths. For example, the third mountain has a long name,  `Kangchenjunga`, but the second mountain has a very short name, `K2`. We can get around this problem by noticing that all the names are composed of only alphanumeric characters.

To match any alphanumeric character we will use the special sequence `\w`, and to help us match names of any length we will use the `*` metacharacter. The `*` metacharacter, matches 0 or more repetitions of the preceding regular expression. In other words, it matches 0 or as many repetitions as possible of the preceding regular expression. For example, the regular expression `ab*` will match `a` or `a` followed by any number of `b`'s, such as `ab` or `abbbbb`. Let's see how this works.

In the code below, we employ the `*` metacharacter to find the names of the mountains regardless of their length:

In [3]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'Mt\.?\s[A-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mt Everest'>
<re.Match object; span=(28, 34), match='Mt. K2'>
<re.Match object; span=(51, 67), match='Mt Kangchenjunga'>
<re.Match object; span=(84, 94), match='Mt. Lhotse'>


We can see that we managed to match all the mountain names regardless of their length or abbreviation.
# Groups

In the code below, we have added a new mountain to our `sample_text` string:

```
Mnt makalu: Height 8,485 m
```

As we can see, the name of this mountain has two differences from the other ones. The first difference is that the word mountain has been abbreviated as `Mnt` instead of `Mt` or `Mt.`. The second difference is that the first letter of the name is lowercase not uppercase. 

To be able to match `Mnt` as well as `Mt` or `Mt.`, we will use the `( )` metacharacters to define a **Group**. As their name suggests, **groups**, group together the expressions contained inside of them. For example, we saw before that `ab*` will match `a` or `a` followed by any number of `b`'s, such as `ab` or `abbbbb`. But, if you put `ab` inside a parenthesis to define the **group** `(ab)`, then `(ab)*` will match zero or more repetitions of `ab`, for example `ab` or `abababab`. You can repeat the contents of a group with any repeating qualifier, such as `*, ?, or {m}` that we have seen before. We can also use the OR `|` metacharacter within the group to be able to select between two expressions. Let’s see how this works.

In the code below, we will use the group `(Mt|Mnt)` in our regular expression to be able to match either `Mnt` or `Mt`:

In [4]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'(Mt|Mnt)\.?\s[a-zA-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mt Everest'>
<re.Match object; span=(28, 34), match='Mt. K2'>
<re.Match object; span=(51, 67), match='Mt Kangchenjunga'>
<re.Match object; span=(84, 94), match='Mt. Lhotse'>
<re.Match object; span=(111, 121), match='Mnt makalu'>


As we can see, we were able to match all the mountain names, including the new one. Also, notice that we added lowercase letters, `[a-zA-Z]`, to our previous character set in our regular expression. This was done in order to be able to match the first lowercase letter of the new name. 

We should point out, that since the first letter in both abbreviations is an `M`, we could have put the `M` outside of the group and gotten the same result, as shown below:

In [6]:
# Import re module
import re

# Sample text
sample_text = '''
Mt Everest: Height 8,848 m
Mt. K2: Height 8,611 m
Mt Kangchenjunga: Height 8,586 m
Mt. Lhotse: Height 8,516 m
Mnt makalu: Height 8,485 m
'''

# Create a regular expression object with a regular expression that can match all the
# mountain names
regex = re.compile(r'M(t|nt)\.?\s[a-zA-Z]\w*')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mt Everest'>
<re.Match object; span=(28, 34), match='Mt. K2'>
<re.Match object; span=(51, 67), match='Mt Kangchenjunga'>
<re.Match object; span=(84, 94), match='Mt. Lhotse'>
<re.Match object; span=(111, 121), match='Mnt makalu'>


Notice that we get the same result as before.

# TODO: Finding email Addresses Revisited

In the cell below, our `sample_text` consists of a multi-line string with four different email addresses. Write a regular expression that is able to find all these email addresses. As usual, save the regular expression object in a variable called `regex`. Then use the `.finditer()` method to search the `sample_text` string for the given regular expression. Finally, write a loop to print all the `matches` found by the `.finditer()` method.

**HINTS:** Notice that all the characters before the `@` symbol only contain lowercase letters, underscores, and numbers. To match this part of the email address we can use the character set `[a-z_0-9]` followed by the `+` metacharacter, to account for the fact that all email addresses must have at least one character or more before the `@` symbol. The `+` metacharacter matches 1 or more repetitions of the preceding regular expression. For example, `ab+` will match `a` followed by any non-zero number of `b`’s, such as `ab` or `abb`, etc.., but it will not match just `a`.

The `@` symbol is not a metacharacter so we can match it directly without the need of escaping it. Also, notice that the domain names contain lowercase letters, uppercase letters, underscores, and dashes. Again we can use the characters set `[a-zA-Z_-]` followed by the `+` metacharacter, to account for the fact that all domains must have at least one character or more after the `@` symbol. To match any dot (`.`), we need to use the backslash (`\.`) because the dot is a metacharacter. You can use the character set `[a-z]+` to match either `com`, `edu`, or `gov`.

To match the last email address you need to add an optional dot followed by another character set of only lowercase letters.

In [7]:
# Import re module
import re

# Sample text
sample_text = '''
fake_email@fake-email.edu
fakeemail43@fake_email.com
fake891_email@fakemail.gov
52fake_email@FAKE_email.com.nl
'''

# Create a regular expression object with a regular expression that can match all
# the email addresses
regex = re.compile(r'[a-z_0-9]+@[a-zA-Z_-]+\.[a-z]+\.?[a-z]+')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 26), match='fake_email@fake-email.edu'>
<re.Match object; span=(27, 53), match='fakeemail43@fake_email.com'>
<re.Match object; span=(54, 80), match='fake891_email@fakemail.gov'>
<re.Match object; span=(81, 111), match='52fake_email@FAKE_email.com.nl'>


# Substitutions

As we mentioned at the beginning of this lesson, the `re` module also has functions that allow us to modify strings. Regex objects have the `.sub()` method that allows us to replace patterns within a string. Let' see an example.

In the code below we have a multi-line string that contains two instances of the ampersand character, `&`. Let's use the `.sub` method to replace these ampersands with the word `and`. First we will create a regular expression that matches all the `&` characters in our string. Then we will use `regex.sub(r'and', sample_text)` to replace every match of the `regex` expression in the `sample_text` with the raw string `and`. Let's see this in action:

In [8]:
# Import re module
import re

# Sample text
sample_text = '''
Ben & Jerry
Jack & Jill
'''

# Create a regular expression object with the regular expression '&'
regex = re.compile(r'&')

# Substitute all & in the sample_text with 'and'
new_text = regex.sub(r'and', sample_text)

# Print Original and Modified texts
print('Original text:', sample_text)
print('Modified text:', new_text)

Original text: 
Ben & Jerry
Jack & Jill

Modified text: 
Ben and Jerry
Jack and Jill



We can see that we have successfully replaced all the `&` characters with the word `and`. Being able to make this kind of substitutions can be really useful and save you a lot of time if you are working with large documents that you need to reformat.

# Substitutions with Groups

We can do more sophisticated substitutions by using groups. Let's see an example. In the code below we have a multi-line string that contains the names of 4 people. As we can see, some people have middle names but other don't. Let's use the `.sub()` method to replace all names in the string with just the first and last name. For example, the name `John David Smith` should be replaced by `John Smith` and `Alice Jackson` should stay the same.

The first step is to create a regular expression that matches all the names in the list. Now, keeping in mind that we need to be able to make replacements later we will use groups to be able to distinguish between the first name, the middle name, and the last name. Since all names have a first name then we can use this group `([a-zA-z]+)` to match all the first names. Now, not all names have middle names, so having a middle name is optional. Since the first and middle name are separated by a whitespace we also need to indicate that the whitespace is also optional. So, to do indicate that the whitespace and middle name are optional we will include the `?` metacharacter after the whitespace and second group, `[ ]?([a-zA-z]+)?`. After the first or middle name we have a whitespace that we can match with `\[  \]`. Notice that in this case we didn't use the sequence `\\s` since this will match newlines as well and we don't what match those. Finally we make a third group to match the last name. Since all names have last names, we don't need to use the `?` metacharacter. Putting all together we get:

In [9]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(1, 17), match='John David Smith'>
<re.Match object; span=(18, 31), match='Alice Jackson'>
<re.Match object; span=(32, 53), match='Mary Elizabeth Wilson'>
<re.Match object; span=(54, 64), match='Mike Brown'>


We can clearly see that we matched all the four names in our list. Now, the cool thing about using groups is that we can reference them individually from the Match Objects using the `.group()` method. The `.group(N)` method selects the `N`th group in the match. Therefore, in our particular case, for each match, `.group(1)` will select the first name, `.group(2)` will select the middle name, and `.group(3)` will select the last name. Let's see how this works in the code below:

In [10]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# For each match print the first, middle, and last name separately
for match in matches:
    print('\nFirst Name: '+ match.group(1))
    
    if match.group(2) is None:
        print('Middle Name: None')
    else:
        print('Middle Name: '+ match.group(2))
    print('Last Name: '+ match.group(3))


First Name: John
Middle Name: David
Last Name: Smith

First Name: Alice
Middle Name: None
Last Name: Jackson

First Name: Mary
Middle Name: Elizabeth
Last Name: Wilson

First Name: Mike
Middle Name: None
Last Name: Brown


We can see that for each of the four matches we can selectively choose the first, middle, or last name. We should also mention that `.group(0)` (or equivalently `.group()`) selects all the groups at once. 

Now, that we know how to select groups individually for each match, we are ready to use the `.sub()` method to make substitutions. Remember, `regex.sub(r'string', sample_text)` will replace every match of the `regex` expression in the `sample_text` with the raw string `string`. So, what we want to do in our case, is to replace every match with only the first and last names, or equivalently replace every match with the first and third groups. We can refer to each group in the `string` by using the backslash. For example, `regex.sub(r'\1', , sample_text)` will replace every match with the first group. Here we have reference the first group by using `\1` inside the `string`. Let's put it all together to see how it works:

In [11]:
# Import re module
import re

# Sample text
sample_text = '''
John David Smith
Alice Jackson
Mary Elizabeth Wilson
Mike Brown
'''

# Create a regular expression object with a regular expression that can find all
# the names in the sample_text and group the first, middle, and
# last names separately
regex = re.compile(r'([a-zA-z]+)[ ]?([a-zA-z]+)?[ ]([a-zA-z]+)')

# Substitute all names in the sample_text with the first and last name
new_text = regex.sub(r'\1 \3', sample_text)

# Print the modified text
print(new_text)


John Smith
Alice Jackson
Mary Wilson
Mike Brown



# Flags

We saw at the beginning of this lesson that regexes are case sensitive, therefore we often have to use regexes with both uppercase and lower case letters. However, the `re.compile(pattern, flags)` function, has a `flag` keyword that can be used to allow more flexibility. For example, the `re.IGNORECASE` flag can be used to perform **case-insensitive** matching. In the code below we have a string that contains the name Walter written in two different combinations of upper and lower case letters. In order to be able to find this two renditions of Walter, we will probably have to use a long character set to be able to account for all possible combinations of lower and upper case letters. However, in this case we can use the `re.IGNORECASE` to indicate that we don't care about the case of the letters, we just want to find the name Walter no matter how it is written. Let's see how this works:

In [12]:
# Import re module
import re

# Sample text
sample_text = 'Alice and WaLtEr Brown are talking with wAlTer Jackson.'

# Create a regular expression object with the regular expression 'walter'
# that ignores the case of the letters
regex = re.compile(r'walter', re.IGNORECASE)

# Search the sample_text for the regular expression
matches = regex.finditer(sample_text)

# Print all the matches
for match in matches:
    print(match)

<re.Match object; span=(10, 16), match='WaLtEr'>
<re.Match object; span=(40, 46), match='wAlTer'>


We can clearly see that we were able to match both renditions of `walter` without any fancy regular expression. 

We have seen a lot in this lesson and we have just began to scratch the surface of regular expressions. For more information on regexes make sure to check out the Python [Regex Documentation ](https://docs.python.org/2/library/re.html#module-re)