# 07 REGEX: PYTHON RE QUESTION MARK

In [1]:
import re

## What's the Python Re ? Quantifier

When applied to regular expression A, Python's A? quantifier matches either zero or one occurrences of A. The ? quantifier always applies only to the preceding regular expression.

For example, the regular expression 'hey?' matches both strings 'he' and 'hey'. But it does not match the empty string because the ? quantifier does not apply to the whole regex 'hey' but only to the preceding regex 'y'.

In [2]:
re.findall('aa[cde]?', 'aacde aa aadcde')

['aac', 'aa', 'aad']

In [3]:
re.findall('aa?', 'accccacccac')

['a', 'a', 'a']

In [4]:
re.findall('[cd]?[cde]?', 'ccc dd ee')

['cc', 'c', '', 'dd', '', 'e', 'e', '']

Don't worry if you had problems understanding those examples. You'll learn about them next. Here's the first example:

In [5]:
re.findall('aa[cde]?', 'aacde aa aadcde')

['aac', 'aa', 'aad']

You use the re.findall() method. In case you don't know it, here's the definition:

The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.

The first argument is the regular expression pattern 'aa[cde]?'. The second argument is the string to be searched for the pattern. In plain English, you want to find all patterns that start with two 'a' characters, followed by one optional character---which can be either 'c', 'd', or 'e'.

The findall() method returns three matching substrings:

* First, string 'aac' matches the pattern. After Python consumes the matched substring, the remaining substring is 'de aa aadcde'.
* Second, string 'aa' matches the pattern. Python consumes it which leads to the remaining substring ' aadcde'.
* Third, string 'aad' matches the pattern in the remaining substring. What remains is 'cde' which doesn't contain a matching substring anymore.


In [6]:
re.findall('aa?', 'accccacccac')

['a', 'a', 'a']

In this example, you're looking at the simple pattern 'aa?'. You want to find all occurrences of character 'a' followed by an optional second 'a'. But be aware that the optional second 'a' is not needed for the pattern to match.

Therefore, the regex engine finds three matches: the characters 'a'.

In [7]:
re.findall('[cd]?[cde]?', 'ccc dd ee')

['cc', 'c', '', 'dd', '', 'e', 'e', '']

This regex pattern looks complicated: '[cd]?[cde]?'. But is it really?

Let's break it down step-by-step:

The first part of the regex [cd]? defines a character class [cd] which reads as "match either c or d". The question mark quantifier indicates that you want to match either one or zero occurrences of this pattern.

The second part of the regex [cde]? defines a character class [cde] which reads as "match either c, d, or e". Again, the question mark indicates the zero-or-one matching requirement.

As both parts are optional, the empty string matches the regex pattern. However, the Python regex engine attempts as much as possible.

Thus, the regex engine performs the following steps:

1. The first match in the string 'ccc dd ee' is 'cc'. The regex engine consumes the matched substring, so the string 'c dd ee' remains.
2. The second match in the remaining string is the character 'c'. The empty space ' ' does not match the regex so the second part of the regex [cde] does not match. Because of the question mark quantifier, this is okay for the regex engine. The remaining string is ' dd ee'.
3. The third match is the empty string ''. Of course, Python does not attempt to match the same position twice. Thus, it moves on to process the remaining string 'dd ee'.
4. The fourth match is the string 'dd'. The remaining string is ' ee'.
5. The fifth match is the string ''. The remaining string is 'ee'.
6. The sixth match is the string 'e'. The remaining string is 'e'.
7.The seventh match is the string 'e'. The remaining string is ''.
8. The eighth match is the string ''. Nothing remains.