## Importing and Using the `re` Module

To start, we need to import the `re` module

In [None]:
import re

The `re.search()` function allows us to see if a string contains a particular pattern in which we're interested. The general syntax is `re.search(<pattern>,<string>)`.

In [None]:
re.search(r"ttt","aggtttcctttagttt")

If the pattern is found in the string, `re.search()` will return an `re.Match` object. If it's not found, `re.search()` will return the value `None`. This behavior allows us to use `re.search()` to execute logical tests.

In [None]:
if re.search(r"ttt","aggtttcctttagttt"):
    print("Pattern found!")
else:
    print("Pattern not found!")

The `re.sub()` function allows us to do find-and-replace operations.

In [None]:
re.sub(r"ttt","TTT","aggtttcctttagttt")

Often, we want to specify patterns that are more flexible. Here's an example where we are looking for a g, then any 3 nucleotides, then a c.

In [None]:
match = re.search(r"g...c","aggtttcctttagttt")

Let's take a look at what part of our DNA sequence matched our pattern.

In [None]:
match.group()

Now, try this example

In [None]:
match = re.search(r"g...c","attcgaagcaggtttcct")
match.group()

The pattern `gtttc` is still present in this sequence. Why did this search return `gaagc`?

In [None]:
allMatches = re.findall(r"g...c","attcgaagcaggtttcct")
print(allMatches)

## Special Characters and Wildcards

Here are some examples using other special and wildcard characters. Before you run each example, predict what the output will be.

In [None]:
# \w matches 'word' characters
match = re.search(r"\w\w","t s m9")
match.group()

In [None]:
# \d matches digits
match = re.search(r"\d","t s m9")
match.group()

In [None]:
# \s matches any whitespace character
match = re.search(r"\s\w\d","t s m9")
match.group()

In [None]:
# ^ matches the beginning of the string
match = re.search(r"^\w\s\w","t s m9")
match.group()

In [None]:
# $ matches the end of the string
match = re.search(r"\w$","t s m9")
match.group()

In [None]:
# \S matches any non-whitespace character
match = re.search(r"\s\S\s","t s m9")
match.group()

In [None]:
# \ can be used to escape special characters
match = re.search(r"\$\d","I have $1.")
match.group()
# What happens if you remove the \ before $?

You can create custom wildcards by listing the potentially matching characters inside square brackets.

In [None]:
# Searching for an even digit followed by an odd digit
match = re.search(r"[2,4,6,8][1,3,5,7,9]","1249787")
match.group()

## Repetition

In [None]:
# + matches one or more instances of a character
match = re.search(r"A+","GCTTTGGAAAGG")
match.group()

In [None]:
# {} can specify a particular number of repetitions
match = re.search(r"A{2}","GCTTTGGAAAGG")
match.group()

Look carefully at the output from the cell above. Why does the result go all the way to the last G, instead of stopping at the first G? This happens because repetition characters are greedy. They'll find the biggest possible match.

In [None]:
# * matches zero or more instances of a character
matchOne = re.search(r"CT*G","GCTTTGGAAAGG")
matchOne.group()

In [None]:
matchTwo = re.search(r"GT*G","GCTTTGGAAAGG")
matchTwo.group()

In [None]:
# ? matches 0 or 1 instances of a character
matchOne = re.search(r"GC?T","GCTTTGGAAAGG")
matchOne.group()

In [None]:
matchTwo = re.search(r"TC?T","GCTTTGGAAAGG")
matchTwo.group()

Repetition characters can also be used in combination with wildcards.

In [None]:
# Matches any pattern with one or more characters between
# a C and a G
match = re.search(r"C.+G","GCTTTGGAAAGG")
match.group()

## Capturing Text from Searches

To capture specific characters embedded in our larger search pattern, we can wrap them in parentheses.

In [None]:
# Capturing two nucleotides just before or after stretches
# of 2 or more As.
match = re.search(r"(\w\w)AA+(\w\w)","GCTTTGCAAAAAGG")
match.group()

In [None]:
# Looking at the captured text
match.groups()

In [None]:
# Capturing text can also be used with re.findall()
matches = re.findall(r"(\w\w)AA+(\w\w)","GCTTTGCAAAAAGGTCTTAGAAAAATG")
print(matches)

In [None]:
# Looking at just the captured text from the 2nd match
print(matches[1])

## Substitution with Captured Text

We can use text captured during a search as part of a subsitution. Here', we're adding vertical lines at the edges of our stretches of As. `\1` indicates the first section of captured text, `\2` the second section, etc.

In [None]:
re.sub(r"(\w\w)(AA+)(\w\w)",r"\1|\2|\3","GCTTTGCAAAAAGGTCTTAGAAAAATG")