### Assignment three -- more flexible matches

In this assignment we're going to try out some approaches to find inexact or fuzzy sequence matches. 
We'll use the built-in regular expression library to find some restriction enzyme sites in various 
DNA fragments. 

Regular expressions are complex! We'll just use a small fraction of their power, but the Python documentation
goes into a great deal more depth on the various features they support:

- [Python regex (re) docs](https://docs.python.org/3/library/re.html)

One word of caution, regular expressions have very complicated implementations, with a bunch of performance 'gotchas'. 

In [3]:
# import the regular expression (re) library
import re

### Basic example of finding a match

There are many ways to use regular expressions. Python allows you to 'compile' regular expressions before you use them. Making the regular expression state machine is expensive, so this saves time later 

In [9]:
nucleotides = 'ACGTTTTAAGACAGATTA'
pattern = re.compile("TTTTA")
match = pattern.search(nucleotides)

if match:
  print('found', match.group())
else:
  print('did not find')

found TTTTA


### Degenerate bases

Now lets start allowing sets of characters, in this case any DNA base, in our regular expression. Lets look for the same pattern except with two Ns at the end:

TTTTANN

Which we convert into the regular expression pattern:

TTTTA[ACGT][ACGT]

In [12]:
nucleotides = 'ACGTTTTAAGACAGATTA'
pattern = re.compile("TTTTA[ACGT][ACGT]")
match = pattern.search(nucleotides)

if match:
  print('found', match.group())
else:
  print('did not find')

found TTTTAAG


### Matches and positions

Often we want to find all the matches and their positions in a target sequence. 

In [15]:
pattern_CG = re.compile("CG")
nucleotides = 'ACGTCGTAAGACGCGATTA'

for match in pattern_CG.finditer(nucleotides):
    print(match.start(), match.group())

1 CG
4 CG
11 CG
13 CG


### Repetition qualifiers

Regex also supports a number of repetition qualifiers, ways to repeate a pattern for either as many times as 
possible (the * operator), called greedy matching, non-greedy (? operator), or a set number or range of times (curly brackets {})

In [18]:
pattern_CG = re.compile("C*G") # we match on as many Cs as possible, followed by a G
nucleotides = 'ACCCCCGTCGTAAGACGCGATTA'

for match in pattern_CG.finditer(nucleotides):
    print(match.start(), match.group())

1 CCCCCG
8 CG
13 G
15 CG
17 CG


In [20]:
pattern_CG = re.compile("C{3}G") # three Cs followed by a G
nucleotides = 'ACCCCCGTCGTAAGACGCGATTA'

for match in pattern_CG.finditer(nucleotides):
    print(match.start(), match.group())

3 CCCG


### Read in a list of restriction sites

There are many, many restriction enzymes out there (see the list starting with 'A' [here](https://en.wikipedia.org/wiki/List_of_restriction_enzyme_cutting_sites:_A)). I pulled a shorter
list from [Promega](https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/), 
which has a relatively small collection. Here we can read in each restriction enzyme with its recognition sequence:

In [23]:
with open("restriction_enzymes_promega.txt") as promega_file:
    header = promega_file.readline()
    for restriction_line in promega_file:
        line_split_into_tokens = restriction_line.strip().split("\t")
        print(line_split_into_tokens[0] + " with sequence " + line_split_into_tokens[1])

AatII with sequence GACGTC
AccB7I with sequence CCANNNNNTGG
AccIII with sequence TCCGGA
Acc65I with sequence GGTACC
ApaI with sequence GGGCCC
AvaI with sequence CYCGRG
AvaII with sequence GGWCC
Bal I with sequence TGGCCA
BamHI with sequence GGATCC
BanII with sequence GRGCYC
BbuI with sequence GCATGC
Bcl I with sequence TGATCA
BglI with sequence GCCNNNNNGGC
BssHII with sequence GCGCGC
BglII with sequence AGATCT
BsaOI with sequence CGRYCG
Bsp1286 I with sequence GDGCHC
BsrBRI with sequence GATNN
BstEII with sequence GGTNACC
BstOI with sequence CCWGG
BstXI with sequence CCANNNNNNTGG
BstZI with sequence CGGCCG
CfoI with sequence GCGC
ClaI with sequence ATCGAT
CspI with sequence CGGWCCG
Csp45I with sequence TTCGAA
DdeI with sequence CTNAG
Eco47III with sequence AGCGCT
Eco52I with sequence CGGCCG
EcoRI with sequence GAATTC
FokI with sequence GGATG
HaeIII with sequence GGCC
HhaI with sequence GCGC
HincII with sequence GTYRAC
HindIII with sequence AAGCTT
HpaII with sequence CCGG
KpnI with sequ

### Homework time (6 points)

The assignment is just one function, but it's a bit more complex than functions we've designed before. You saw above that we can compile regular expressions 
into a variable. We can then use that variable to perform subsequent matches. Here we're going to create a function that given a name and degenerate pattern 
creates a regex that matches all possible sequences with that pattern. For instance:
    
XmnI recognizes the pattern GAANNNNTTC. The returned compiled regex should match GAA __TTAA__ TTC or GAA __GGGA__ TTC.

You should support all of the standard FASTA codes that apply to DNA (A,C,G,T,R,Y,K,M,S,W,B,D,H,V,N) from the [wikipedia page](https://en.wikipedia.org/wiki/FASTA_format).

In [14]:
# Completing this function is your homework
def restriction_regex_generator(sequence_pattern):
    
    return(compiled_regex) # you can call the variable whatever you'd like

### An example

lets pretend there's an alien world where the bases are coded differently:

Normal bases:
- __P__ pairs with __T__
- __K__ pairs with __L__

Degenerate codes:
- __M__ codes for either __P__ or __T__
- __N__ codes for either __K__ or __L__
- __X__ codes for any base

Like above, to help a friend out you've been asked to write a function that takes the degenerate coding, for instance the degenerate sequence __PM__ which could be either __PP__ or __PT__, and creates a regular expression that finds that sequence somewhere in a genome. Here's something you put together:

In [12]:
def alien_restriction_regex_generator(sequence_pattern):
    resulting_alien_regex_string = ''
    
    for alien_base in sequence_pattern:
        if alien_base == 'P' or alien_base == 'T' or alien_base == 'K' or alien_base == 'L':
            resulting_alien_regex_string += alien_base
        if alien_base == 'M':
            resulting_alien_regex_string += '[PT]'
        if alien_base == 'N':
            resulting_alien_regex_string += '[KL]'
        if alien_base == 'X':
            resulting_alien_regex_string += '[PTKL]'
    
    compiled_regex = pattern = re.compile(resulting_alien_regex_string)
    
    return(compiled_regex)

### Testing the alien function

Now that you've written your function, you want to test it in a number of different ways:

In [13]:
alien_regex_for_PTMN = alien_restriction_regex_generator("PTMN")

# show what the resulting pattern looks like
print(alien_regex_for_PTMN)

# this should result in match
match = alien_regex_for_PTMN.search("PTTK")
if match:
  print('found', match.group())
else:
  print('did not find string matching the pattern')

# this should also result in match
match = alien_regex_for_PTMN.search("PTTL")
if match:
  print('found', match.group())
else:
  print('did not find string matching the pattern')


# this should NOT match
match = alien_regex_for_PTMN.search("PTKT")
if match:
  print('found', match.group())
else:
  print('did not find string matching the pattern')

re.compile('PT[PT][KL]')
found PTTK
found PTTL
did not find string matching the pattern
