# Lab 4 : Regular expressions

## Learning Objectives

* Regular Expressions


## 4 .1 Regular expressions

The Python string methods are great for ease of use. If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code.  

However, string methods are limited to the simplest of search and replace cases. The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace  methods have the same limitations.

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in UNIX world. Many bioiformatic problems involve using regular expressions and we will spend a lot of time in this class building our knowledge of regular expressions through many use cases.

Python provides regular expression patterns using the RE module.  Moldules in python are simply files consisting of Python code, that are not typically used when a Python program is run.  Later we will use other Python math and science specific modules that have been writting by experts in the field.

There are several different functions of the RE module and a nice HOWTO in the Python Docs-https://docs.python.org/3/howto/regex.html#compiling-regular-expressions and also see the manual pages on regular expressions - https://docs.python.org/3/library/re.html?highlight=regular%20expression#. 
### Search

This function searches for FIRST occurrence of RE pattern within string with optional flags. The syntax for this function:

re.search(pattern, string, flags=0)

* pattern - This is the regular expression to be matched.
* string - This is the string, which would be searched to match the pattern anywhere in the string.
* flags - You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below.

The re.search function returns a match object on success, None on failure. Here is how it works


In [1]:
# Example 4.1
# Name:re_search_raw_string.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaaa'

# Find a pattern in the DNA sequence
# The r' indicates a raw string search 
motif = re.search(r'TTG', DNA)

if motif :
   print (motif.group())
else :
   print ("No match!!")

# The flag re.I is set so that the search is case insensitive
motif = re.search(r'AAA', DNA, re.I)

if motif :
   print (motif.group())
else :
   print ("No match!!")

TTG
Aaa


The code motif = re.search(pat, str) stores the search result in a variable named "motif". Then the if-statement tests the match -- if true the search succeeded and motif.group() is the matching text (e.g. 'TTG'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. I recommend that you always write pattern strings with the 'r' just as a habit. There is an alternative form using re.complile that is used by some. 



In [2]:
# Example 4.2
# Name:re_search.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaaa'

# Find a pattern in the DNA sequence
# The r' indicates a raw string search 
pattern = re.compile('TTG')
motif = re.search(pattern, DNA)

if motif :
   print (motif.group())
else :
   print ("No match!!")

# The flag re.I is set so that the search is case insensitive
pattern = re.compile('AAA', re.I)
motif = re.search(pattern, DNA)

if motif :
   print (motif.group())
else :
   print ("No match!!")

TTG
Aaa


The start, end and both coordinates of a pattern can be foud with the methods, 
* start() 	Return the starting position of the match
* end() 	Return the ending position of the match
* span() 	Return a tuple containing the (start, end) positions of the match

In [3]:
# Example 4.3
# Name:re_search_coordinates.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaaa'

# Find a pattern in the DNA sequence
# The r' indicates a raw string search 
motif = re.search(r'TTG', DNA)

if motif :
    print (motif.group())
    print (motif.start())
    print (motif.end())
    print (motif.span())
    
else :
    print ("No match!!")

TTG
2
5
(2, 5)


### Findall 

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match. 

In [4]:
# Example 4.4
# Name:re_findall.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Regular expressions are compiled into pattern objects
# As in other variables pattern could be anything such as p or emily_dickinson

# Find a pattern in the DNA sequence
motifs = re.findall(r'TGA', DNA)

# or to see as a list
print(motifs)

['TGA', 'TGA']


If there is no pattern found an empty list is returned

In [5]:
# Example 4.5
# Name:re_findall.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Regular expressions are compiled into pattern objects
# As in other variables pattern could be anything such as p or emily_dickinson

# Find a pattern in the DNA sequence
motifs = re.findall(r'SAM', DNA)

# or to see as a list
print(motifs)

[]


So it is good practice like searching for keys in a dictionary to use an if else statement to check whether there is a match

In [6]:
# Example 4.6
# Name:re_findall.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Regular expressions are compiled into pattern objects
# As in other variables pattern could be anything such as p or emily_dickinson

# Find a pattern in the DNA sequence
motifs = re.findall(r'TGA', DNA)

if motifs :
    print (motifs)
else :
    print ("No match !!!")

['TGA', 'TGA']


Like with list or dictionaries with can iterate to print each item

In [7]:
# Example 4.7
# Name:re_findall.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Regular expressions are compiled into pattern objects
# As in other variables pattern could be anything such as p or emily_dickinson

# Find a pattern in the DNA sequence
motifs = re.findall(r'TGA', DNA)

if motifs :
    for motif in motifs :
       print (motif)
else :
    print ("No match !!!")

TGA
TGA


Notice what happens below when we search for the pattern AAA using case sensistive and insensitive approaches

In [8]:
# Example 4.8
# Name:re_findall.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Regular expressions are compiled into pattern objects
# As in other variables pattern could be anything such as p or emily_dickinson

# Find a pattern in the DNA sequence
motifs = re.findall(r'AAA', DNA)

if motifs :
    print (motif)
else :
    print ("No match !!!")


# Find a pattern in the DNA sequence
motifs = re.findall(r'AAA', DNA, re.I) # re.I is equivalent to re.IGNORECASE

if motifs :
    print (motifs)
else :
    print ("No match !!!")


No match !!!
['Aaa', 'aaa']


Notice that in the second search for AAA it only found 2 patterns.  In this case the search for the second match begins after the last character for the first match.  Finding overlapping patterns with just one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches.  In the near future Python will have the flag overlapped=True which can be used to find overlapping patterns.

## Finditer

findall returns a list of strings. If one or more groups are present in the pattern, return a list of groups.  finditer returns an iterator yielding MatchObject instances. To find the coordinates for a pattern finditer is helpful.

In [9]:
# Example 4.9
# Name:re_finditer.py
# Description:  This programs demonstrates the use of regular expressions

import re
Protein = 'CTCMGDVEKGCTCKKIFIMKISQCGCHTVEKGGKHKTGPNLHGLFGRKTGCQAPGYSYTAANKNKGCTC'
pattern = r'CTC'

for motif in re.finditer(pattern, Protein) :
    print (motif.group())
    print (motif.span())
    print (motif.start())
if motif is None :
    print ("No match !!!")


CTC
(0, 3)
0
CTC
(10, 13)
10
CTC
(66, 69)
66


A different test for the presence of motifs is used in finditer than above in findall. This is because the result of the motif search is not empty (even if there are no motifs). This is a peculiarity of finditer because it is expecting to find and iterate through patterns.  The first example gives no results and it appears that the code isn't working, but it is.

In [10]:
# Example 4.10
# Name:re_finditer.py
# Description:  This programs demonstrates the use of regular expressions
import re
Protein = 'CTCMGDVEKGCTCKKIFIMKISQCGCHTVEKGGKHKTGPNLHGLFGRKTGCQAPGYSYTAANKNKGCTC'
pattern = r'SKY'

for motif in re.finditer(pattern, Protein) :
    print (motif.group())
    print (motif.span())
    print (motif.start())
if motif is None :
    print ("No match !!!")

Use None to test for presence of the pattern

In [11]:
# Example 4.11
# Name:re_finditer.py
# using None to test for presence of the pattern
import re
Protein = 'CTCMGDVEKGCTCKKIFIMKISQCGCHTVEKGGKHKTGPNLHGLFGRKTGCQAPGYSYTAANKNKGCTC'
pattern = r'SKY'

motif = None

for motif in re.finditer(pattern, Protein) :
    print (motif.group())
    print (motif.span())
    print (motif.start())
if motif is None :
    print ("No match !!!")

No match !!!


### Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters ( . ^ $ * + ? { [ ] \ | ( ) do not match themselves because they have special meanings (details below)

\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character. 

. (a period) -- matches any single character except newline '\n'

\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

\b -- boundary between word and non-word

\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

\t, \n, \r -- tab, newline, return

\d -- decimal digit [0-9]

^ = start, $ = end -- match the start or end of the string

In [12]:
# Example 4.12
# Name:re_finditer.py
# Description:  This programs demonstrates the use of regular expressions

import re
Protein = 'CTCMGDVEKGCTCKKIFIMKISQCGCHTVEKGGKHKTGPNLHGLFGRKTGCQAPGYSYTAANKNKGCTC'
pattern = r'C\wC'

for motif in re.finditer(pattern, Protein) :
    print (motif.group())
    print (motif.span())
    print (motif.start())
if motif is None :
    print ("No match !!!")

CTC
(0, 3)
0
CTC
(10, 13)
10
CGC
(23, 26)
23
CTC
(66, 69)
66


### Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

<pre>
+ match 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* match 0 or more occurrences of the pattern to its left
? match 0 or 1 occurrences of the pattern to its left 
</pre>

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

In [13]:
# Example 4.13
# Name:re_repetion.py
# Description:  This programs demonstrates the use of regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Find a pattern in the DNA sequence
motifs = re.findall(r'AAA+', DNA, re.I)

for motif in motifs :
   print ('motif found using AAA+ : ', motif)
    
# Find a pattern in the DNA sequence
motifs = re.findall(r'AAA*', DNA, re.I)

for motif in motifs :
   print ('motif found using AAA* : ', motif)
    
# Find a pattern in the DNA sequence
motifs = re.findall(r'AAA?', DNA, re.I)

for motif in motifs :
   print ('motif found using AAA? : ', motif)

motif found using AAA+ :  Aaaaaaa
motif found using AAA* :  AA
motif found using AAA* :  Aaaaaaa
motif found using AAA? :  AA
motif found using AAA? :  Aaa
motif found using AAA? :  aaa


The above results are a bit of a mind bender.  Takes some time to think through the output.

### Square brackets

Square brackets can be used to indicate a set of chars, so [acgt] matches 'a' or 'c' or 'g' or 't'.  You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'. 

In [14]:
# Example 4.14
# Name:re_square_backets.py
# Description:  This programs demonstrates the use of square brackets in regular expressions

# Import the regular expression module
import re

DNA = 'AGTTGTAATGAGGCTGCCGTGATAaaaaaa'

# Find a pattern in the DNA sequence
motifs = re.findall(r'G[ACGT]G', DNA, re.I)

for motif in motifs :
   print ('motif found using G[ACGT]G : ', motif)

# Find a pattern in the DNA sequence
motifs = re.findall(r'A[^A]A', DNA, re.I)

for motif in motifs :
   print ('motif found using A[^A]A : ', motif)


motif found using G[ACGT]G :  GAG
motif found using G[ACGT]G :  GTG
motif found using A[^A]A :  ATA


In [None]:
# Example 4.15
# Name:find_leucine_codons.py
# Description:  This programs tests regular expression to find leucine codons with and or |


# Import the regular expression module
import re

# RNA with all six leucine codons
RNA = 'CUACUGCUCCUUUUAUUG'

# Find a pattern in the RNA sequence

pattern1 = re.compile('CU[ACGU]', re.IGNORECASE)
pattern2 = re.compile('UU[AG]', re.IGNORECASE)

motifs = re.findall(pattern1 or pattern2, RNA)

for motif in motifs :
   print ('motif with or: ', motif)

motifs = re.findall(pattern1 and pattern2, RNA)

for motif in motifs :
   print ('motif with and: ', motif)

motifs = re.findall(r'CU[ACGU]|UU[AG]', RNA)   

for motif in motifs :
   print ('motif with | : ', motif)

In [15]:
# Example 4.16
# Name:find_protein.py
# Description:  This program finds putive proteins from a string of amino acids

# Import the regular expression module
import re

protein = 'MAALMRGLGDRV_QLQ_VVMGVINSQGAS_GYSHT_T'; 

# Algorithm 1
# This regular expression finds a protein that starts with M
# Then contains any letter followed by a stop
# It is considered a greedy algorithm because it finds the longest string between M and _
pattern = r'M[\w]*_'
motifs = re.findall(pattern, protein)
if motifs :
    for motif in motifs :
       print ('A protein found by algorithm 1 is : ',  motif)
else :
    print ('No putative protein was found by algorithm 1')


# Algorithm 2
# This regular expression finds a protein that starts with M
# Then contains any letter followed by a stop
# The *? renders it a non greedy algorithm that finds the next stop
pattern = r'M[\w]*?_'
motifs = re.findall(pattern, protein)
if motifs :
    for motif in motifs :
       print ('A protein found by algorithm 2 is : ', motif)
else :
    print ('No putative protein was found by algorithm 2')

# Algorithm 3
# This regular expression finds a protein that starts with M
# Then contains any letter followed by a stop
# The *? renders it a non greedy algorithm that finds the next stop
# The M[^M] means the string can not contain M except at the start
pattern = r'M[^M]*?_'
motifs = re.findall(pattern, protein)

if motifs :
    for motif in motifs :
       print ('A protein found by algorithm 3 is : ', motif)
else :
    print ('No putative protein was found by algorithm 3')

A protein found by algorithm 1 is :  MAALMRGLGDRV_QLQ_VVMGVINSQGAS_GYSHT_
A protein found by algorithm 2 is :  MAALMRGLGDRV_
A protein found by algorithm 2 is :  MGVINSQGAS_
A protein found by algorithm 3 is :  MRGLGDRV_
A protein found by algorithm 3 is :  MGVINSQGAS_


Algorithm 1 is obviously wrong since the protein contains a stop codon in the middle.  In theory either algorithm 2 or 3 could lead to a correct protein protein.  However, usually it is the longer protein that includes the M in the sequence that is correct.  Gene prediction programs include more information such as the ribosomal binding site motif to help distinquish between proteins in Algorithm 2 and 3.  

Note:  Although I left out the else statement that does something (e.g. print) if there is no match from many of the above examples.  It is good practice to leave it in.  

## Exercises

1. Write a program to find the motifs with the pattern CGIC in the protein sequence  'MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLCCCCFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGICGICKKKEERADLIAYLKKATNE'

2. Write a program to find the heme motifs with the pattern CxxC where x can be any amino acid in the protein sequence (use a match to any character)  'MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLCCCCFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGICGICKKKEERADLIAYLKKATNE'

3. Modify the above program so that it also prints out the coordinates of each the heme motif in each protein.

4. Search for possible stop codons in the DNA sequence 'ATGGTTGAAATGAGGCTAAGCCGTGATAG'

5. Using a list of the six DNA sequences and your code from Lab 3 Part 1 Exercise 2 search for possible stop codons in all sequences.

6. Using a dictionary of the six DNA sequences and your code from Lab 3 Part 2 Exercise 2 search for possible stop codons in all sequences.

7. Write a program for finding protein sequences in the DNA sequence below.  Use your code from Lab 3 Part 2 Exercise 3 that translates a DNA sequence in all 3 reframes then use Algorthim 2 in Example 4.16 to find the protein(s).

'TTACGATGATGCAGGAATCTGCGACAGAGACAATAAGCAACAGTTCAATGAATCAAAATGGAATGAGCACTCTAA
GCAGCCAATTAGATGCTGGCAGCAGAGATGGAAGATCAAGTGGTGACACCAGCTCTGAAGTAAGCACAGT
AGAACTGCTGCATCTGCAACAACAGCAGGCTCTCCAGGCAGCAAGACAACTTCTTTTACAGCAGCAAACA
AGTGGATTGAAATCTCCTAAGAGCAGTGATAAACAGAGACCACTGCAGGTGCCTGTGTCAGTGGCCATGA
TGACTCCCCAGGTGATCACCCCTCAGCAAATGCAGCAGATCCTTCAGCAACAAGTCCTGTCTCCTCAGCA
GCTACAAGCCCTTCTCCAACAACAGCAGGCTGTCATGCTGCAGCAGGATTTTTTGGATTCTGGATTGGAA
AATTTCAGAGCTGCCTTGGAAAAAAATCAACAACTACAAGAGTTTTACAAGAAACAGCAAGAGCAGTTAC
ATCTTCAGCTTTTGCAGCAGCAGCAGCAACAGCAGCAGCAGCAACAACAGCAGCAACAACAGCAGCAGCA
ACAACAACAACAACAGCAGCAACAACAGCAGCAGCAGCAGCAACAGCAGCAGCAGCAGCAACAGCATCCT
GGAAAGCAAGCGAAAGAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAATTGGCAGCCCAGCAGCTTGTCT
TCCAGCAGCAGCTTCTCCAGATGCAACAACTCCAGCAGCAGCAGCATCTGCTCAGCCTTCAGCGTCAGGG
ACTCATCTCCATTCCACCTGGCCAGGCAGCACTTCCTGTCCAATCGCTGCCTCAAGCTGGCTTAAGTCCT
GCTGAGATTCAGCAGTTATGGAAAGAAGTGACTGGAGTTCACAGTATGGAAGACAATGGCATTAAACATG
GAGGGCTAGACCTCACTACTAACAATTCCTCCTCGACTACCTCCTCCAACACTTCCAAAGCATCACCACC
AATAACTCATCATTCCATAGTGAATGGACAGTCTTCAGTTCTAAGTGCAAGACGAGACAGCTCGTCACAT
GAGGAGACTGGGGCCTCTCACACTCTCTATGGCCATGGAGTTTGCAAATGGCCAGGCTGTGAAAGCATTT
GTGAAGATTTTGGACAGTTTTTAAAGCACCTTAACAATGAACACGCATTGGATGACCGAAGCACTGCTCA
GTGTCGAGTGCAAATGCAGGTGGTGCAACAGTTAGAAATACAGCTTTCTAAAGAACGCGAACGTCTTCAA
GCAATGATGACCCACTTGCACATGCGACCCTCAGAGCCCAAACCATCTCCCAAACCTCTAAATCTGGTGT
CTAGTGTCACCATGTCGAAGAATATGTTGGAGACATCCCCACAGAGCTTACCTCAAACCCCTACCACACC
AACGGCCCCAGTCACCCCGATTACCCAGGGACCCTCAGTAATCACCCCAGCCAGTGTGCCCAATGTGGGA
GCCATACGAAGGCGACATTCAGACAAATACAACATTCCCATGTCATCAGAAATTGCCCCAAACTATGAAT
TTTATAAAAATGCAGATGTCAGACCTCCATTTACTTATGCAACTCTCATAAGGCAGGCTATCATGGAGTC
ATCTGACAGGCAGTTAACACTTAATGAAATTTACAGCTGGTTTACACGGACATTTGCTTACTTCAGGCGT
AATGCAGCAACTTGGAAGAATGCAGTACGTCATAATCTTAGCCTGCACAAGTGTTTTGTTCGAGTAGAAA
ATGTTAAAGGAGCAGTATGGACTGTGGATGAAGTAGAATACCAGAAGCGAAGGTCACAAAAGATAACAGG
AAGTCCAACCTTAGTAAAAAATATACCTACCAGTTTAGGCTATGGAGCAGCTCTTAATGCCAGTTTGCAG
GCTGCCTTGGCAGAGAGCAGTTTACCTTTGCTAAGTAATCCTGGACTGATAAATAATGCATCCAGTGGCC
TACTGCAGGCCGTCCACGAAGACCTCAATGGTTCTCTGGATCACATTGACAGCAATGGAAACAGTAGTCC
GGGCTGCTCACCTCAGCCGCACATACATTCAATCCACGTCAAGGAAGAGCCAGTGATTGCAGAGGATGAA
GACTGCCCAATGTCCTTAGTGACAACAGCTAATCACAGTCCAGAATTAGAAGACGACAGAGAGATTGAAG
AAGAGCCTTTATCTGAAGATCTGGAATGAAATTATGTTATTATATTGAA'
