# Finding CRISPR arrays
## Specification
## Program Name: randomizedMotifSearch.py

## Input/Output: STDIN, STDOUT

## Options:

- -i iterations (int)
- -k motif length (int)
- -p pseudocount (float)

In [None]:
Ex: python randomizedMotifSearch.py -i 1000 -k 13 -p 1 < input.fa > output.out

Inspection Notebook example:
main ( "input.fa", options = [ "-i 1000", "-k 13", "-p 1"] )

## Background
CRISPR arrays exist in most archaeal species and many bacterial species. The arrays are encoded in chromosomal DNA as direct repeats (DR) that flank sequences (spacers) that provide immunity against invading plasmids, viral sequences, and other foreign DNA .. and RNA. These arrays can be short, comprising 1-2 spacers, or they can contain hundreds of spacer elements. The arrays are punctuated by these DR sequences. The promoter element seems to be upstream of the array, and initiates transcription of the associated RNA strand that is then cleaved and those spacers are loaded into protein machines that provide specific targeting of the associated invader should it be found. That targeting causes cleavage of the foreign DNA in known cases.

There are three main classes of CRISPR systems; these are type I,II and III, and are established based on the associated protein machines - actually the specific genes that encode for them. In the archaea, only type I and III have been found so far, while in bacteria, all three types have been seen. One of those types, type II, exists in a few species, including _Streptococcus pyogenes_, which encodes for a type II system. The specific gene that is used to classify this system is named Cas9.

We see cases where Cas genes, and sometimes CRISPR arrays, exist in a viral sequence. This, of course, opens many questions about the role of the CRISPR system with respect to viruses - why would a virus carry an immune defense that targets viral sequence? The existence of these arrays suggests that the CRISPR system itself might be mobile. How does the system get established, and how does the surrounding transcription machinery get encoded on the chromosome?

For this assignment, we will work to find the promoter motif. We would expect that this promoter motif and an associated B recognition element (BRE) would be present in archaeal species. What is that motif, and where is it located relative to transcription start?

We know little about the sequence specifics of the promoter element, though it seems to be about 13 nucleotides in length. In Pyrobaculum species, genes are packed quite tightly, so we can look upstream of the array by about 50 bases to see if a common motif sequence might exist. We need to assume that some mutations can be tolerated in that sequence.

## Null model
In prior years, we included an option that scrambled the input sequences before running to find a best score. This would effectively eliminate any signal that might be present in the data, while preserving the overall composition of the sequences. We would then compare the best score achieved without scrambling to the score achieved with scrambling. The information that remains in a scrambled set of sequences is only the base level composition of the sequences. We now use Relative Entropy to do this comparison of our experimental model(P) to our null(Q):

$Q=Pr_{s\in\lbrace{ACTG}\rbrace}(s)=\frac{count(s)}{N}$

We can then use relative entropy to find a score relative to the base level composition by:

$REscore = \sum_{i=0}^{cols}\sum^{j \in ACGT}P_{i}(j)\log_{2}(\frac{P_{i}(j)}{Q(j)} )$

## Assignment
__Write a BME205 style program (.py file)__ that is based on the Randomized Motif Search presented in class/videos and is described in Chapter 2 of the text. Your program should accept a fasta file that contains multiple fasta records as input (STDIN), and your program will then output (STDOUT) the consensus sequence and associated profile score. The score will be the sum of encoding costs (entropies) across each position in the final profile. We will use pseudocounts for this assignment, as described in the text (p. 86-89), and we will need an option to provide the number of pseudocounts (-p).

__Produce a notebook that should be fully runable for use during inspections.__

I find it easier to accumulate counts rather than calculating a probability based profile. This will also make it a bit easier to deal with pseudocounts when scoring your count-based profile.

$score = \sum_{i=0}^{cols}\sum^{j \in ACGT}P_{i}(j)\log_{2}(\frac{P_{i}(j)}{Q(j)} )$

Randomized Motif Search (p. 93) can be easily trapped in local minima, so we will need to iterate some number of times to find a trajectory that produces the best results. This will need to be an option (-i) that establishes the iteration number.

We also don't know the appropriate motif length that we are looking for, so we need an option (-k) to specify this.

The full command line string should then look something like this:

In [None]:
randomizedMotifSearch.py -i=100000 -p=1 -k=13 <somefile.fa >someOutputFile.fa

your program will need to use standard OO style, include docstrings and be well commented. It must run! 

I will provide a few fasta files that present the upstream 50 bases from species of the Pyrobaculum group. It is possible that the promotor+BRE that we are looking for is conserved across these species, maybe not perfectly though. Providing more data to a single run may be useful.

I do have expression data for these species, so we know that not all of these arrays are functional - at least under the conditions when I grew them.

Your program should output the final score achieved, along with the associated consensus motif.

## Extra Credit
The following optional features would be useful and considered for a combined extra 5 points:

1) -g Use Gibbs sampling to find the optimal consensus motif.

2) -m Print the specific motif and the name of the contributing sequence for each of the participating fasta records.

Each of these options should only be true when the flag is specified by the user (ie. you shouldn't have to explicitly state True/False on the commandline).

Submit your working program to Canvas.

## Notes:

1) It is likely that we will need to amend this assignment as we work with it.

2) Sept-2021 eliminated the need for scrambling sequences by using relative entropy

## Inspection intro

- What data structure did you use to hold your sequence motifs (DNA)? 
I used pandas dataframes.
- Is your profile  structure based on counts or frequencies? 
My profile structure is based on counts.
- Where did you implement iterations (mostly why did you choose this)?
I implimented iterations into a separate function, which calls the randomizedmotifsearch function. I figured it was easier and cleaner to have the RMS function be as close to the text as possible, and simply use something else to call it.
- If you implemented Gibbs Sampling, how did you implement recomputing of the profile and motif?

## Inspection results

Seeing my team members implementations, there are some similarities. I think mine's the cleanest method, though. Trevor and Konstantinos had fairly cluttered notebooks, while Dodi's was pretty concise. I ended up liking Trevor's method of calculating the null model counts so I went and took it, and made it more concise by creating it on initialization instead of having a separate function for it.

In [1]:
import sys
import math

class FastAreader:
    def __init__(self, fname=''):
        '''contructor: saves attribute fname '''

        self.fname = fname
        self.fileH = None

    def doOpen(self):
        if self.fname == '':
            return sys.stdin
        else:
            return open(self.fname)

    def readFasta(self):

        header = ''
        sequence = ''

        with self.doOpen() as self.fileH:

            header = ''
            sequence = ''

            # skip to first fasta header
            line = self.fileH.readline()
            while not line.startswith('>'):
                line = self.fileH.readline()
            header = line[1:].rstrip()

            for line in self.fileH:
                if line.startswith('>'):
                    yield header, sequence
                    header = line[1:].rstrip()
                    sequence = ''
                else:
                    sequence += ''.join(line.rstrip().split()).upper()

        yield header, sequence


In [2]:
class CommandLine():
    '''
    Handle the command line, usage and help requests.

    CommandLine uses argparse, now standard in 2.7 and beyond. 
    it implements a standard command line argument parser with various argument options,
    a standard usage and help, and an error termination mechanism do-usage_and_die.

    attributes:
    all arguments received from the commandline using .add_argument will be
    avalable within the .args attribute of object instantiated from CommandLine.
    For example, if myCommandLine is an object of the class, and requiredbool was
    set as an option using add_argument, then myCommandLine.args.requiredbool will
    name that option.

    '''

    def __init__(self, inOpts=None):
        '''
        CommandLine constructor.
        Implements a parser to interpret the command line argv string using argparse.
        '''

        import argparse
        self.parser = argparse.ArgumentParser(
            description='Program prolog - a brief description of what this thing does',
            epilog='Program epilog - some other stuff you feel compelled to say',
            add_help=True,  # default is True
            prefix_chars='-',
            usage='%(prog)s [options] -option1[default] <input >output'
        )

        self.parser.add_argument('-k', '--motif_length', type=int, default=15, action='store',
                                 help='kMer size')
        self.parser.add_argument('-p', '--pseudo_count', type=int, default=1, action='store',
                                 help='pseudo count value')
        self.parser.add_argument('-i', '--iterations', type=int, default=1000, action='store',
                                 help='times to run RMS')


        self.parser.add_argument('-v', '--version', action='version', version='%(prog)s 0.1')
        if inOpts is None:
            self.args = self.parser.parse_args()
        else:
            self.args = self.parser.parse_args(inOpts)

In [144]:
import numpy as np
import pandas as pd

class Motif():
    
    def __init__(self, DNA, k, pseudo=1, iterations=1000):
        # initialize variables
        self.bases = ['A', 'C', 'G', 'T']
        self.DNA = DNA
        self.k = k
        self.t = t
        self.pseudo = pseudo
        self.iterations = 1000
        
        # initialize null model using concantenated DNA sequences
        self.string =''.join(self.DNA)
        self.null = {base: (self.pseudo + self.string.count(base))/(len(self.string)+(self.pseudo*4)) for base in self.bases}
    
    def rand_motifs(self):
        # use RNG to create a bunch of random motifs
        motifs = []
        for seq in self.DNA:
            start_i = np.random.randint(0, len(seq)-self.k+1)
            end_i = start_i + self.k
            
            motifs.append(seq[start_i:end_i])  
        return motifs


    def Score(self, profile):
        # Find relative entropy score using profile table
        score = 0
        for x in range(self.k):
            for base in self.bases: 
                prob_x = profile.loc[base, x]
                score += prob_x * np.log2( prob_x/self.null[base] )
        return score
 
    
    def Count(self, motifs):
        # construct count table that's already filled with pseudo-count
        count_table = pd.DataFrame(    np.full((4,self.k), self.pseudo)   , index=self.bases    )
        
        # iterate over every motif and return a table with counts of bases
        for motif in motifs:
            for index, base in enumerate(motif): #bases and index correspond to rows and columns
                count_table.loc[base, index] += 1
        
        return count_table
    def Profile(self, motifs):
        # use count table to calculate profile probabilities using base count and sum
        count_table = Count(motifs)
        
        profile_table = pd.DataFrame(    np.full((4,self.k), self.pseudo)   , index=self.bases     )
        
        for x in range(self.k):
            total = sum(count_table[x])
            for base in self.bases:#           count / total
                profile_table.loc[base, x] = count_table.loc[base, x] / total
        return profile_table

    
    def Consensus(self, profile):
        # iterates over a profile and returns the consensus string according to
        # the max value and corresponding row (base) of each column
        consensus_string = ''
        for x in range(self.k):
            consensus_string += profile[x].idxmax()
            
        return consensus_string
    
    def motifs_from_profile(self, DNA, profile):
        # uses profile table to find the kmer with the best cumulative probability
        # in each sequence
        motifs = []
        for seq in DNA:
            best_prob = -1
            for x in range(len(seq)-self.k+1):
                kmer = seq[x:x+k]
                prob = 1
                for index, base in enumerate(kmer):
                    prob *= profile.loc[base, index]
                if prob > best_prob:
                    best_prob = prob
                    best_motif = kmer
            motifs.append(best_motif)
        return motifs
     
    def RandomizedMotifSearch(self):
        # conduct a depth-first search from randomly selected kmers
        best_motifs = self.rand_motifs()
        best_score = self.Score( self.Profile(best_motifs)  )

        while 1:
            profile = self.Profile(best_motifs)
            motifs = self.motifs_from_profile(self.DNA, profile)
            
            score = self.Score( self.Profile(motifs) )
            if score > best_score:
                best_motifs = motifs
                best_score = score

            else:
                return [best_motifs, best_score]
    
    def complete_search(self):
        # repeat randomized motif search a number of times and return the
        # consensus string with the best score
        final_list = []
        for x in range(self.iterations):
            motifs_candidate = self.RandomizedMotifSearch()
            final_list.append(motifs_candidate)
        final_motifs = sorted(final_list, key=lambda x: x[1], reverse=True)
        motifs, score = final_motifs[0]
        profile = self.Profile(motifs)
        return self.Consensus(profile), score
            

In [145]:

k = 8
t = 5
DNA = [
    'CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA',
    'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
    'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
    'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
    'AATCCACCAGCTCCACGTGCAATGTTGGCCTA'
]

cheese = Motif(DNA, k)

THE_MOTIF = cheese.complete_search()
print(THE_MOTIF)

('GTGTACAG', 2.709372622306024)
