<h1 id="toctitle">Introduction to Python 9 | Regular expressions</h1>
<ul id="toc"/>

##What are regular expressions?

Regular expressions (aka. _regex_) are a special mini-language for describing patterns in strings. 

Handy for working with patterns in DNA/protein, also text files of many different types.

Also crop up in other tools: text editors, grep.

###Regular expression module

The tools for using regular expressions live in the `re` module. We have to `import` the module, then use the module name when running functions:

In [3]:
import re

re.search('a', 'abc')

search('a', 'abc')

NameError: name 'search' is not defined

##Raw strings 

Prefix a string with `r` to turn it into a raw string. Raw strings ignore special characters:

In [5]:
my_string = "hello\nworld\t!"
print(my_string)

my_raw_string = r"hello\nworld\t!"
print(my_raw_string)

hello
world	!
hello\nworld\t!


##Searching for patterns with variation

`re.search()` takes two arguments: a pattern and a string. It searches for the pattern in the string and returns `True` or `False`. 

Here is a boring pattern:

In [6]:
dna = "ATCGCGAATTCAC"

if re.search(r"GAATTC", dna):
    print("restriction site found!")
else:
    print("no restriction site!")

restriction site found!


Note that we don't really need a raw string in the example above, but it's a good habit.

###Alternation

Here's an example of a pattern that's a bit more interesting. When there are two different possibilities we surround them with parentheses and separate with pipe characters:

In [7]:
dna = "ATCGCGAATTCAC"

if re.search(r"GG(A|T)CC", dna):
    print("restriction site found!")
else:
    print("no restriction site!")


no restriction site!


###Character groups

A very common type of alternation is when we want to allow any one of a list of characters. We can write it like this:

In [9]:
dna = "ATCGCGAATTCAC"
if re.search(r"GC(A|T|G|C)GC", dna):
    print("restriction site found!")

or with a shorthand like this:

In [10]:
if re.search(r"GC[ATGC]GC", dna):
    print("restriction site found!")

Sometimes it's easier to describe a character group by listing the characters that are __not__ allowed. Special rule: if the character group starts with ^ then it means any character except these ones:

In [23]:
dna = "ATCGCGYAATTCAC"

if re.search(r"[^ATGC]", dna):
    print("ambiguous base found!")

ambiguous base found!


There are useful shortcuts for some commonly-used character groups. A full stop (aka dot, period) stands for any character. 

###Quantifiers

Another type of variation is the number of times something is repeated.

A question mark means the thing preceding it is optional. In the pattern `GCG?Y`the second G is optional. The pattern will match `GCGY` and `GCY`. 

A plus means that the thing preceding it can be repeated more than once. In the pattern `GCG+Y` the second G can be repeated, so it matches `GCGY`, `GCGGY`, `GCGGGY`, etc. but __not__ `GCY`

An asterisk is the most flexible quantifier; the thing preceding it optional, but can also be repeated. The pattern `GCG*Y` will match `GCY`, `GCGY`, `GCGGY`, `GCGGGY`, etc. 

For more specificity, we can specify a minimum and maximum number of repetitions:

`GCG{2,4}Y` will match `GCGGY`, `GCGGGY` and `GCGGGGY` but __not__ `GCGY` or `GCGGGGGY`, etc. 


###Positions

Unlike all the stuff above, __positions__ specify where the pattern has to match the string. 

`^` means the start (don't get confused: it also means a negated character group as described earlier). So `^G` will match `GATC` but not `ATGC`. 

`$` means the end, so the pattern `G$` will match `ATCG` but not `AGTC` 

###Combinations

The real power of regular expressions comes from combining all these features. Here's a complex regular expression that describes a full length messenger RNA with start codon and polyA tail:

`^ATG[ATGC]{30,1000}A{5,10}$`

Look at the features:

- string must start with ATG
- then between 30 and 1000 bases that must be A/T/G/C
- string must end with between 5 and 10 consecutive As



##Other stuff we can do with regular expressions

###Extracting the match

In an `if` statement, `re.search()` behaves like a true/false function, but in fact it returns a re match object. We can get that match object and use methods to get information from it. For example with our non-ATGC base example:

In [12]:
dna = "ATCGCGYAATTCAC"

if re.search(r"[^ATGC]", dna):
    print("ambiguous base found!")

ambiguous base found!


We know that we found a non-ATGC base, but what was it? Calling `group()` on the match object will tell us:

In [15]:
dna = "CGATCGGAAYCGATC"
m = re.search(r"[^ATGC]", dna)

# m is now a match object
if m:
    print("ambiguous base found!")
    ambig = m.group()
    print("the base is " + m.group())

ambiguous base found!
the base is Y


###Getting the position

Another thing we can do is get the position of the match with `start()` (also `end()`):

In [16]:
dna = "CGATCGGAAYCGATC"
m = re.search(r"[^ATGC]", dna)

# m is now a match object
if m:
    print("ambiguous base found!")
    ambig = m.group()
    print("the base is " + m.group())
    print("the base is at position " + str(m.start()))

ambiguous base found!
the base is Y
the base is at position 9


###Splitting a string with a regex

`re.split()` works just like regular `split()`, but takes a regular expression pattern as the separator. Here we split a DNA sequence whenever we see a non-ATGC base. Note pattern reuse!

In [17]:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"

runs = re.split(r"[^ATGC]", dna)

print(runs)

['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']


The output is a list of strings. 

Above, we exclude the bits of the string that matched the pattern and just keep the non-matching bits. For the opposite, use `re.findall()`. E.g. find all runs of A/T that are at least four bases long:

In [20]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"

runs = re.findall(r"[AT]{4,}", dna)

print(runs)

['ATTATAT', 'AAATTATA']


###Finding multiple matches

Some problems require complete match objects for multiple matches - use `re.finditer()`. E.g. using the same pattern and sequence, what are the start/stop positions of all runs of A/T?

In [22]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{3,100}", dna)
for match in runs:
    run_start = match.start()
    run_end = match.end()
    print(run_start, run_end)

(5, 12)
(18, 26)


##Exercises

###Accession numbers, again

Here's a list of made-up gene accession numbers:

`xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp`

Write a program that will print only the accession names that satisfy the following criteria – treat each criterion separately:

- contain the number 5
- contain the letter d or e
- contain the letters d and e in that order
- contain the letters d and e in that order with a single letter between them
- contain both the letters d and e in any order
- start with x or y
- start with x or y and end with e
- contain three or more numbers in a row
- end with d followed by either a, r or p

###Double digest

Look at the file _long_dna.txt_ which contains a made-up DNA sequence. 

Predict the fragment lengths that we will get if we digest the sequence with a made-up restriction enzyme __AbcI__, whose recognition site is `ANT*AAT` (easy).

What will the fragment lengths be if we do a double digest with both __AbcI__ and __AbcII__, whose recognition site is `GCRW*TG` (hard)? Can you predict the sequences of the fragments themselves?

(asterisks indicate the position of the cut site)

In [25]:
# ignore this cell, it's for loading custom js code
from IPython.core.display import Javascript
Javascript(filename="custom.js")

<IPython.core.display.Javascript object>

In [24]:
# ignore this cell, it's for loading custom css code
from IPython.core.display import HTML
HTML(filename="custom.css")