# Restriction Enzymes
There is an enzyme database [rebase](http://rebase.neb.com/rebase/rebase.html) that contains a special format (also adpoted by [IUPAC](https://iupac.org/):

<h2 align="center"><code>G^AATTC</code></h2>

Where the `^` identifies the position of the DNA sequence where the cut is performed, below are some useful functions to handle this.

In [1]:
from re import finditer

In [2]:
enz = "G^AANTC"
dna1 = "ATGAAAGAAGTCTTATGAATGAGCCTCAGCTGAAGAANTCCATCGCGCAGAANTCCTACGCTCAGACTCAGACTCAGCATTATAGTGAATTCTTAATAAATAAAATAA"

In [3]:
def rebase_to_regex(rebase):
    dic = {"U": "T", "R": "[AG]", "Y":"[CT]", "S":"[CG]", "W":"[AT]", "K":"[GT]", "M":"[AC]", "B":"[CGT]", "D":"[AGT]", "H":"[ACT]", "V":"[ACG]", "N":"[ACGT]", "-":".", "^":""}
    return "".join(map(lambda x: x if x not in dic else dic[x], rebase)), rebase.index("^")

In [4]:
reg = rebase_to_regex(enz)
print(reg)
assert reg == ("GAA[ACGT]TC", 1), "There seems to be an error with the function"

('GAA[ACGT]TC', 1)


In [5]:
def cut_positions(dna, enz):
    """given a DNA seq and a rebase enzyme, get the positions where cuts will happen in the enzyme"""
    reg, off = rebase_to_regex(enz)
    for x in finditer(reg, dna): yield x.span()[0] + off

In [6]:
positions = list(cut_positions(dna1, enz))
print(positions)
assert positions == [7, 87], "There seems to be an error with the function"

[7, 87]


In [7]:
def cut_subsequences(dna, enz):
    """Given a dna sequence and a restriction enzyme, return the resulting cut subsequences"""
    last = 0
    for p in cut_positions(dna, enz):
        yield dna[last:p]
        last = p
    yield dna[last:]

In [8]:
subseq = list(cut_subsequences(dna1, enz))
print(subseq)
assert subseq == ['ATGAAAG', 'AAGTCTTATGAATGAGCCTCAGCTGAAGAANTCCATCGCGCAGAANTCCTACGCTCAGACTCAGACTCAGCATTATAGTG', 'AATTCTTAATAAATAAAATAA'], "There seems to be an error with the function"

['ATGAAAG', 'AAGTCTTATGAATGAGCCTCAGCTGAAGAANTCCATCGCGCAGAANTCCTACGCTCAGACTCAGACTCAGCATTATAGTG', 'AATTCTTAATAAATAAAATAA']


# Out of RegEx and into the frying counters

In [19]:
from collections import defaultdict

In [40]:
def find_repeated_subsequences_len_k(seq, k, top=None):
    freq = defaultdict(int)
    for i in range(0, len(seq) - k + 1): freq[seq[i:i+k]]+=1
    return sorted(freq.items(), key=lambda x: x[1], reverse=True)[:top if top else len(freq)]

In [43]:
res = find_repeated_subsequences_len_k("ABCABCABCDFG", 3)
print(res)
assert res == [('ABC', 3), ('BCA', 2), ('CAB', 2), ('BCD', 1), ('CDF', 1), ('DFG', 1)], "There seems to be an error with the function"

[('ABC', 3), ('BCA', 2), ('CAB', 2), ('BCD', 1), ('CDF', 1), ('DFG', 1)]
