# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 2: Regular Expressions

1. What are Regular Expressions?
    - Special Characters
2. Matching and Modifying Strings
    - The `re` Module
    - The Match Object
    - Sets and Groups
    - Compiling RegEx Patterns
3. PROSITE Patterns

#### Requirements
- Python 2.7 or 3.x
- Miscellaneous Files
    - `./images/central_dogma.jpg`
    - `./images/codon_table.jpg`

In [1]:
from __future__ import print_function, division

## What are Regular Expressions?

Regular expression are patterns containing text and special characters. They are used as a concise and flexible way to match strings. Regular expressions are used in many computer science applications:

- Bioinformatics Applications
    - Sequence Analysis: Identifying promoter sequences, DNA binding sites, gene structure
- Other Applications
    - Information Retrieval: Parsing and searching text
    - Natural Language Processing
    - SQL
        
Let's say we want to search a string called `text`:

    text = "hello DNA!! ATTTTCGCGCAT"

### RE Special Characters
<table align="left">
<tr><td style="text-align:center"><b>Pattern</b></td><td><b>Description</b></td><td><b>Example</b></td><td><b>Matches</b></td></tr>
<tr><td style="text-align:center">`simple text`</td><td>Matches exact text</td><td style="text-align:center">`"DNA"`</td><td style="text-align:center">`["DNA"]`</td></tr>
<tr><td style="text-align:center">`.`</td><td>Matches any single character</td><td style="text-align:center">`"h."`</td><td style="text-align:center">`["he"]`</td></tr>
<tr><td style="text-align:center">`^`</td><td>Matches the beginning of the string</td><td style="text-align:center">`"^hello"`</td><td style="text-align:center">`["hello"]`</td></tr>
<tr><td style="text-align:center">`$`</td><td>Matches the end of a string</td><td style="text-align:center">`"CAT$"`</td><td style="text-align:center">`["CAT"]`</td></tr>
<tr><td style="text-align:center">`*`</td><td>Matches zero or more of the preceding pattern</td><td style="text-align:center">`"AT*"`</td><td style="text-align:center">`["A", "ATTTT", "AT"]`</td></tr>
<tr><td style="text-align:center">`+`</td><td>Matches one or more of the preceding pattern</td><td style="text-align:center">`"G+"`</td><td style="text-align:center">`["G", "G"]`</td></tr>
<tr><td style="text-align:center">`?`</td><td>Matches zero or one of the preceding pattern</td><td style="text-align:center">`"CG?"`</td><td style="text-align:center">`["CG", "CG", "C"]`</td></tr>
</table>

## Matching and Modifying Strings

In [2]:
## Import the re module
import re

## Define the text to search
text = "hello DNA!! ATTTTCGCGCAT"

## The re.findall() function will return a list of all matches
re.findall(r"AT+", text)

['ATTTT', 'AT']

In [3]:
## The re.search() function will return a match object
## for the first match only
mo = re.search(r"AT*", text)
mo

<re.Match object; span=(8, 9), match='A'>

In [4]:
## Match objects contain multiple pieces of info about the match
print(mo.group())
print(mo.span())
print(mo.start())
print(mo.end())

A
(8, 9)
8
9


In [5]:
## The re.finditer() function will return an iterator
## with a match object for each match
mo_iter = re.finditer(r"AT*", text)
for m in mo_iter:
    print(m.group(), "starts at index", m.start())

A starts at index 8
ATTTT starts at index 12
AT starts at index 22


In [6]:
## The re.split() function
re.split(r"!!", text)

['hello DNA', ' ATTTTCGCGCAT']

In [7]:
## The re.sub()
re.sub(r"!!", "??", text)

'hello DNA?? ATTTTCGCGCAT'

In [8]:
text2 = re.sub(r"!!", "??", text)
text2

'hello DNA?? ATTTTCGCGCAT'

In [9]:
## The re.subn() method returns a tuple containing the 
## updated string and the number of replacements
re.subn(r"T", "t", text)

('hello DNA!! AttttCGCGCAt', 5)

### Repetitions, Sets, and Groups

    text = "hello DNA!! ATTTTCGCGCAT"

<table align="left">
<tr><td style="text-align:center"><b>Pattern</b></td><td><b>Description</b></td><td><b>Example</b></td><td><b>Matches</b></td></tr>
<tr><td style="text-align:center">`{m}`</td><td>Matches exactly `m` repetitions of the preceding</td><td style="text-align:center">`"C{2}"`</td><td style="text-align:center">`["CC"]`</td></tr>
<tr><td style="text-align:center">`{m,n}`</td><td>Matches from `m` to `n` repetitions of the preceding</td><td style="text-align:center">`"T{1,3}"`</td><td style="text-align:center">`["TTT, T, T"]`</td></tr>
<tr><td style="text-align:center">`[]`</td><td>Matches any of a set of characters (character ranges can be used)</td><td style="text-align:center">`"[hDN]"`</td><td style="text-align:center">`["h", "D", "N"]`</td></tr>
<tr><td style="text-align:center">`[^]`</td><td>Matches anything other than the set of characters specified</td><td style="text-align:center">`"[^a-zA-Z]"`</td><td style="text-align:center">`["!", "!"]`</td></tr>
<tr><td style="text-align:center">`A|B`</td><td>Matches pattern `A` or pattern `B`</td><td style="text-align:center">`"l{2}|AT+"`</td><td style="text-align:center">`["ll", "ATTTT", "AT"]`</td></tr>
<tr><td style="text-align:center">`()`</td><td>Parentheses are used to group a pattern into a single entity</td><td style="text-align:center">`"(CG)+"`</td><td style="text-align:center">`["CGCG"]`</td></tr>
</table>

\*Note: Within `[]` special characters (e.g. `., +, *`) are interpreted literally.

#### More Examples

In [10]:
re.findall(r"[^a-zA-Z ]", text)

['!', '!']

In [11]:
## Use groups to find repetitions
seq = "ATAATAAGATGCGCGCGCGCGCTTATGCGCGCGCA"
seq_iter = re.finditer(r"(AT)(GC)+", seq)
for x in seq_iter:
    print(x.group())
    print(x.span())

ATGCGCGCGCGCGC
(8, 22)
ATGCGCGCGC
(24, 34)


In [12]:
## You can access the individual groups by number.
## 0 is the default and will return the entire match
## Individual groups can be accessed starting with 1
seq = "ATAATAAGATGCGCGCGCGCGCTTATGCGCGCGCA"
seq_iter = re.finditer(r"(AT)((GC)+)", seq)
for x in seq_iter:
    print(x.group(1))
    print(x.span(1))

AT
(8, 10)
AT
(24, 26)


### Capturing and Labeling Groups

Groups are captured so that you can refer to them later, both within the pattern itself and in the match result (as seen above). To reference a group in the pattern itself use a `\` followed by the number of the group (e.g. `\1` or `\2`) Groups can also be labeled so they can be referenced by name. To label a group, insert `?P<group_name>` inside the parentheses just before the group pattern. You can also specify that a group NOT be captured by placing `?:` inside the parentheses before the group pattern. 

\*Note: Remember that `\` is an escape character. If you don't use a raw string (`r""`), you'll have to use a double backslash `\\`.

In [13]:
## When a pattern contains groups, re.findall() will
## only return the captured groups
re.findall(r"(A|C)(G)(\1)", seq)

[('A', 'G', 'A'),
 ('C', 'G', 'C'),
 ('C', 'G', 'C'),
 ('C', 'G', 'C'),
 ('C', 'G', 'C'),
 ('C', 'G', 'C')]

In [14]:
seq

'ATAATAAGATGCGCGCGCGCGCTTATGCGCGCGCA'

In [15]:
seq_iter2 = re.finditer(r"(A|C)(G)\1", seq)
for x in seq_iter2:
    print(x.group())

AGA
CGC
CGC
CGC
CGC
CGC


In [16]:
## Match an email address using labeled groups
email = "user@domain.edu"
mo2 = re.search(r"(?P<username>[a-zA-Z0-9_.]+)@(?P<domain>[a-zA-Z0-9_.]+\.(com|edu|org))", email)
print(mo2.group())
print(mo2.group("username"))
print(mo2.group("domain"))

user@domain.edu
user
domain.edu


In [17]:
mo2.group(2)

'domain.edu'

In [18]:
mo2.group(3)

'edu'

### Compiling RegEx Patterns

A regex pattern can be compiled to help speed up a search. This is useful when searching a large amount of text.



In [19]:
## Use the re.compile() method to compile a pattern
rgx = re.compile(r"[Hh]ello")
text2 = "Hello, world. hello, Python!"
rgx.findall(text2)

['Hello', 'hello']

## PROSITE Patterns

### The Central Dogma of Biology

<img src="./images/central_dogma.jpg" width="400" height="400" align="left" />

1. DNA is transcribed to messenger RNA (mRNA)
2. mRNA is translated to amino acids
3. Amino acids are the building blocks of proteins

### RNA Codon Table

The following table shows the standard genetic code: mRNA codons (triplets) and the amino acids they specify. Also shown are the three-letter codes and IUPAC single-letter codes for each amino acid.

<img src="./images/codon_table.jpg" height="500" align="left" />

#### Types of Protein Structure

1. Primary Structure: amino acid (AA) sequence
2. Secondary Structure: AA sequence interacts with itself to form local structures (e.g. beta sheets, alpha helices)
3. Tertiary Structure: Secondary structures interact to form the 3D protein structure
4. Quaternary Structure: Protein complexes (multiple proteins interact)

### The PROSITE Database

- A collection of amino acid sequences associated with protein domains (e.g. structural and/or functional)
    - Useful for analysis of know sequences
    - A tool for analyzing function conservation across biological systems
    - Prediction of function of a novel sequence
- Patterns are similar to regular expressions, with the following syntax:
    - Every amino acid position is separated by a hyphen (`-`)
    - Amino acids are specified by their IUPAC codes ([http://www.bioinformatics.org/sms/iupac.html](http://www.bioinformatics.org/sms/iupac.html))
    - Ambiguous positions are specified by placing all acceptable AAs in `[]`
    - Ambiguous positions are also specified by placing unacceptable AAs in `{}`
    - An `x` is used if any AA is accepted
    - Repititions are indicated by following an element with a numerical value (or range for `x`) in parentheses.
    - [http://prosite.expasy.org/scanprosite/scanprosite_doc.html](http://prosite.expasy.org/scanprosite/scanprosite_doc.html)

#### An Example PROSITE Pattern:

    K(2)‐{KR}‐C‐G‐H-[LMQR]

## In-Class Exercises

In [None]:
## Exercise 1.
## K(2)‐{KR}‐C‐G‐H-[LMQR]
## Using the rules mentioned above, translate this PROSITE Pattern to a RegEx


In [None]:
## Exercise 2.
## seq_ex2 = "ATTTTATAACGATGCGGGCGCG"
## Write a regular expression to search for cases 
## where there are repeat bases
## "AA", "TTTT", "GGG", ...
seq_ex2 = "ATTTTATAACGATGCGGGCGCG"


In [None]:
## Exercise 3.
## Write a regular expression to search for an 'A' or 'T' 
## preceded by "CC" or "GG" or "CG" or "GC"
## "CGA", "CCT", "GCT", ...


## References

- <u>Python for Bioinformatics</u>, Sebastian Bassi, CRC Press (2010)
- <u>Python Essential Reference</u>, David Beazley, 4th Edition, Addison‐Wesley (2008)
- [http://docs.python.org/](http://docs.python.org/)
- [http://www.regular-expressions.info/](http://www.regular-expressions.info/)

#### Last Updated: 15-Sep-2022