# Locating Restriction Sites

## The Billion-Year War

The war between viruses and bacteria has been waged for over a billion years. Viruses called bacteriophages (or simply phages) require a bacterial host to propagate, and so they must somehow infiltrate the bacterium; such deception can only be achieved if the phage understands the genetic framework underlying the bacterium's cellular functions. The phage's goal is to insert DNA that will be replicated within the bacterium and lead to the reproduction of as many copies of the phage as possible, which sometimes also involves the bacterium's demise.
To defend itself, the bacterium must either obfuscate its cellular functions so that the phage cannot infiltrate it, or better yet, go on the counterattack by calling in the air force. Specifically, the bacterium employs aerial scouts called restriction enzymes, which operate by cutting through viral DNA to cripple the phage. But what kind of DNA are restriction enzymes looking for?

The restriction enzyme is a homodimer, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule; both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA. The chance that both strands of phage DNA will be cut (thus crippling the phage) is greater if the target is located on both strands of phage DNA, as close to each other as possible. By extension, the best chance of disarming the phage occurs when the two target copies appear directly across from each other along the phage DNA, a phenomenon that occurs precisely when the target is equal to its own reverse complement. Eons of evolution have made sure that most restriction enzyme targets now have this form.

You may be curious how the bacterium prevents its own DNA from being cut by restriction enzymes. The short answer is that it locks itself from being cut through a chemical process called [[DNA methylation|http://rosalind.info/glossary/dna-methylation/]].

## Problem

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC.

### Given: 
A DNA string of length at most 1 kbp in FASTA format.

### Return: 
The position and length of every reverse palindrome in the string having length between 4 and 12. You may return these pairs in any order.

### Sample Dataset
```
>Rosalind_24
TCAATGCATGCGGGTCTATATGCAT
```

### Sample Output
```
4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4
```

## Solution 1 - Check everything

ATGCAT
TACGTA

Get all the kmers between 4 and 12 bases and compare them to the partner kmer on the reverse strand.

In [1]:
import Bio.Seq as bio
from Bio import SeqIO

In [2]:
# Function to compare kmers and pick palimdromes from nucleotide sequence
def palindromeSol1(forward, reverse):
    # loop through the forward sequence 
    for m in range(len(forward)):
        # Loop through sequence starting at m
        for n in range(m, len(forward)):            
            kmer = forward[m:n + 1]
            # Get the reverse kmer
            reverse_k = reverse[m:n + 1]
            # If the kmer is to short or long skip it
            if len(kmer) >= 4 and len(kmer) <= 12:        
                # Is the kmer a palindrome?
                if kmer == reverse_k[::-1]:                
                    print(m + 1, len(kmer))            

In [3]:
sStr = "TCAATGCATGCGGGTCTATATGCAT"
sSeq = bio.Seq(sStr)
print(sStr)
sCom = sSeq.complement() 
print(sCom)

TCAATGCATGCGGGTCTATATGCAT
AGTTACGTACGCCCAGATATACGTA


In [4]:
print(sStr)
print(sCom[::-1])

TCAATGCATGCGGGTCTATATGCAT
ATGCATATAGACCCGCATGCATTGA


In [5]:
palindromeSol1(sStr, sCom)

4 6
5 4
6 6
7 4
17 4
18 4
20 6
21 4


In [7]:
# Rosalind Data
nucs = SeqIO.read('rosalind_revp.txt', 'fasta')
frw = str(nucs.seq)                       
rev = str(nucs.seq.complement())

palindromeSol1(frw, rev)

1 4
2 4
26 4
50 4
54 4
64 4
80 4
87 4
95 4
116 4
121 4
126 4
202 6
203 4
224 4
230 4
241 4
241 6
242 4
242 6
243 4
244 4
248 6
249 4
254 4
268 8
269 6
270 4
286 4
290 4
296 4
301 4
305 4
332 6
333 4
379 4
410 4
425 12
426 10
427 4
427 8
428 6
429 4
431 4
448 4
453 4
464 6
465 4
482 4
512 4
513 6
514 4
519 4
524 4
527 4
548 4
568 4
581 4
590 4
608 4
613 4
615 4
618 4
618 12
619 10
620 8
621 4
621 6
622 4
623 4
626 4
652 6
653 4
691 4
716 4
735 6
736 4
745 12
746 10
747 4
747 8
748 6
749 4
751 4
761 4
775 6
776 4
780 4
793 4
807 4
816 4
824 4
846 4
857 4


# Rosalind Solution

## Principle:

This problem is based on the reverse complement problem, only now you only need the reverse complement of a half site, then check whether it follows said "half site". So, for sites of a given length: 

* go through the sequence from beginning to end, taking each subsequence of half this length consequtively
* compute the reverse complement of the half site
* use string.find to check whether the following bases contain the reverse complement of the half-site
* if so (= find does not return -1), add to the list of "restriction sites"
* add 1 to the position to get the normal base counting starting at 1 rather than Python's 0.

## Any advanced information

This method only finds "restriction sites" that are complete reverse palidromes and thus consist of an even number of bases. In reality, 5-base sites also occur, the middle base is then often either A/G (IUPAC code R) or C/T (IUPAC code Y), i.e. several possible sequences can be cut by such an enzyme.