# Finding CRISPR arrays
## Specification
## Program Name: randomizedMotifSearch.py

## Input/Output: STDIN, STDOUT

## Options:

- -i iterations (int)
- -k motif length (int)
- -p pseudocount (float)

In [None]:
Ex: python randomizedMotifSearch.py -i 1000 -k 13 -p 1 < input.fa > output.out

Inspection Notebook example:
main ( "input.fa", options = [ "-i 1000", "-k 13", "-p 1"] )

## Background
CRISPR arrays exist in most archaeal species and many bacterial species. The arrays are encoded in chromosomal DNA as direct repeats (DR) that flank sequences (spacers) that provide immunity against invading plasmids, viral sequences, and other foreign DNA .. and RNA. These arrays can be short, comprising 1-2 spacers, or they can contain hundreds of spacer elements. The arrays are punctuated by these DR sequences. The promoter element seems to be upstream of the array, and initiates transcription of the associated RNA strand that is then cleaved and those spacers are loaded into protein machines that provide specific targeting of the associated invader should it be found. That targeting causes cleavage of the foreign DNA in known cases.

There are three main classes of CRISPR systems; these are type I,II and III, and are established based on the associated protein machines - actually the specific genes that encode for them. In the archaea, only type I and III have been found so far, while in bacteria, all three types have been seen. One of those types, type II, exists in a few species, including _Streptococcus pyogenes_, which encodes for a type II system. The specific gene that is used to classify this system is named Cas9.

We see cases where Cas genes, and sometimes CRISPR arrays, exist in a viral sequence. This, of course, opens many questions about the role of the CRISPR system with respect to viruses - why would a virus carry an immune defense that targets viral sequence? The existence of these arrays suggests that the CRISPR system itself might be mobile. How does the system get established, and how does the surrounding transcription machinery get encoded on the chromosome?

For this assignment, we will work to find the promoter motif. We would expect that this promoter motif and an associated B recognition element (BRE) would be present in archaeal species. What is that motif, and where is it located relative to transcription start?

We know little about the sequence specifics of the promoter element, though it seems to be about 13 nucleotides in length. In Pyrobaculum species, genes are packed quite tightly, so we can look upstream of the array by about 50 bases to see if a common motif sequence might exist. We need to assume that some mutations can be tolerated in that sequence.

## Null model
In prior years, we included an option that scrambled the input sequences before running to find a best score. This would effectively eliminate any signal that might be present in the data, while preserving the overall composition of the sequences. We would then compare the best score achieved without scrambling to the score achieved with scrambling. The information that remains in a scrambled set of sequences is only the base level composition of the sequences. We now use Relative Entropy to do this comparison of our experimental model(P) to our null(Q):

$Q=Pr_{s\in\lbrace{ACTG}\rbrace}(s)=\frac{count(s)}{N}$

We can then use relative entropy to find a score relative to the base level composition by:

$REscore = \sum_{i=0}^{cols}\sum^{j \in ACGT}P_{i}(j)\log_{2}(\frac{P_{i}(j)}{Q(j)} )$

## Assignment
__Write a BME205 style program (.py file)__ that is based on the Randomized Motif Search presented in class/videos and is described in Chapter 2 of the text. Your program should accept a fasta file that contains multiple fasta records as input (STDIN), and your program will then output (STDOUT) the consensus sequence and associated profile score. The score will be the sum of encoding costs (entropies) across each position in the final profile. We will use pseudocounts for this assignment, as described in the text (p. 86-89), and we will need an option to provide the number of pseudocounts (-p).

__Produce a notebook that should be fully runable for use during inspections.__

I find it easier to accumulate counts rather than calculating a probability based profile. This will also make it a bit easier to deal with pseudocounts when scoring your count-based profile.

$score = \sum_{i=0}^{cols}\sum^{j \in ACGT}P_{i}(j)\log_{2}(\frac{P_{i}(j)}{Q(j)} )$

Randomized Motif Search (p. 93) can be easily trapped in local minima, so we will need to iterate some number of times to find a trajectory that produces the best results. This will need to be an option (-i) that establishes the iteration number.

We also don't know the appropriate motif length that we are looking for, so we need an option (-k) to specify this.

The full command line string should then look something like this:

In [None]:
randomizedMotifSearch.py -i=100000 -p=1 -k=13 <somefile.fa >someOutputFile.fa

your program will need to use standard OO style, include docstrings and be well commented. It must run! 

I will provide a few fasta files that present the upstream 50 bases from species of the Pyrobaculum group. It is possible that the promotor+BRE that we are looking for is conserved across these species, maybe not perfectly though. Providing more data to a single run may be useful.

I do have expression data for these species, so we know that not all of these arrays are functional - at least under the conditions when I grew them.

Your program should output the final score achieved, along with the associated consensus motif.

## Extra Credit
The following optional features would be useful and considered for a combined extra 5 points:

1) -g Use Gibbs sampling to find the optimal consensus motif.

2) -m Print the specific motif and the name of the contributing sequence for each of the participating fasta records.

Each of these options should only be true when the flag is specified by the user (ie. you shouldn't have to explicitly state True/False on the commandline).

Submit your working program to Canvas.

## Notes:

1) It is likely that we will need to amend this assignment as we work with it.

2) Sept-2021 eliminated the need for scrambling sequences by using relative entropy

## Inspection intro

- What data structure did you use to hold your sequence motifs (DNA)?
 > I used lists to hold the sequence motifs.

    
- Is your profile  structure based on counts or frequencies? 
 > My profile structure is based on counts.

- Where did you implement iterations (mostly why did you choose this)?
> The iterations in the Random Motif Search Algorithm are implemented to repeatedly perform the motif discovery process. The choice of implementing iterations is primarily driven by the need to explore different possibilities and iteratively refine the motifs. Here's why iterations are crucial and why they are chosen in this algorithm
- If you implemented Gibbs Sampling, how did you implement recomputing of the profile and motif?

## Inspection Results

I was stuck with relative entropy and the use of pseudocounts. Kerney, Paul, and I discussed and understood the concept of relative entropy, as well as how to calculate the null distribution, and how to incorporate pseudocounts into the calculations. They were instrumental in helping me correct my code by adding pseudocounts to the existing base counts in the sequence.

In [5]:
import sys
import math
import random
from collections import Counter


class FastAreader:
    def __init__(self, fname=''):
        '''contructor: saves attribute fname '''

        self.fname = fname
        self.fileH = None

    def doOpen(self):
        if self.fname == '':
            return sys.stdin
        else:
            return open(self.fname)

    def readFasta(self):

        header = ''
        sequence = ''

        with self.doOpen() as self.fileH:

            header = ''
            sequence = ''

            # skip to first fasta header
            line = self.fileH.readline()
            while not line.startswith('>'):
                line = self.fileH.readline()
            header = line[1:].rstrip()

            for line in self.fileH:
                if line.startswith('>'):
                    yield header, sequence
                    header = line[1:].rstrip()
                    sequence = ''
                else:
                    sequence += ''.join(line.rstrip().split()).upper()
        yield header, sequence

class CommandLine():
    """
    Handle the command line, usage and help requests.

    CommandLine uses argparse, now standard in 2.7 and beyond.
    it implements a standard command line argument parser with various argument options,
    a standard usage and help, and an error termination mechanism do-usage_and_die.

    attributes:
    all arguments received from the commandline using .add_argument will be
    available within the .args attribute of object instantiated from CommandLine.
    For example, if myCommandLine is an object of the class, and requiredbool was
    set as an option using add_argument, then myCommandLine.args.requiredbool will
    name that option.

    """

    def __init__(self, inOpts=None):
        """
        CommandLine constructor.
        There are three arguments that are passed to the class Consensus():
            -i iterations (int)
            -k motif length (int)
            -p pseudocount (float)
        """
        import argparse
        self.parser = argparse.ArgumentParser(
            description='Program prolog - a brief description of what this thing does',
            epilog='Program epilog - some other stuff you feel compelled to say',
            add_help=True,  # default is True
            prefix_chars='-',
            usage='python randomizedMotifSearch.py -i int --maxMotif int --cutoff int < input.fa > output.out'
        )
        # Be sure to go over the argument information again
        self.parser.add_argument('-i', type=int, action='store',
                                 help='number of iterations (int)')
        self.parser.add_argument('-k', type=int, action='store',
                                 help='motif length (int)')
        self.parser.add_argument('-p', type=int, action='store',
                                 help='pseudocount (float)')

        if inOpts is None:
            self.args = self.parser.parse_args()
        else:
            self.args = self.parser.parse_args(inOpts)

class Genome:
    """
    This class helps in initializing and storing different attributes that can be used across several functions across the class. 
    
    """
    def __init__(self, i, k, p, input_sequences):
        """

        """
        self.iterations = i
        self.pseudo_counts = p
        self.kmer_size = k
        self.input_sequences = input_sequences
        self.best_score = 0
        self.best_motif = []
        self.null_distribution = self.null_model_distribution()
        
    def newMotif(self, profile):
        """
        Function: To get new set of motifs from the existing profiles.
        input: profile of a set of existing selected motifs. 
        
        return: new set of motifs that generated using the counts of bases from the profile. 
        """


        new_motif = []
        for seq in self.input_sequences:
            best_prod_score = 0
            best_motif = []

            for i in range(0, len(seq)-self.kmer_size+1):
                motif = seq[i:i+self.kmer_size]
                product_score = 1
                for pos, base in enumerate(motif):
                    count_from_profile = profile[base][pos]
                    product_score = product_score * count_from_profile
                if product_score > best_prod_score:
                    best_prod_score = product_score
                    best_motif = motif
                
            new_motif.append(best_motif)
        
        return new_motif

    def selectRandomKmer(self, input_sequence_list):
        """
        Function: select random kmers from all the reads in the fast file. 
        input: input sequence list from the fasta file.
        
        return: set of random motifs of size k.
        """
        
        rand_kmer_array = []
        
        for sequence in input_sequence_list:
            
            rand_index = random.randint(0, len(sequence)-self.kmer_size)
            
            random_kmer = sequence[rand_index:rand_index+self.kmer_size]
            rand_kmer_array.append(random_kmer)
        return rand_kmer_array
    
    def null_model_distribution(self):
        """
        Function that returns a dictionary of number of counts of every bases {A,C,G,T} in the complete genome sequence.
        """
        input_seq = self.input_sequences
        joined_input_seq = "".join(input_seq)
        col_counter = dict(Counter(joined_input_seq))

        null_distribution = {k: (v+self.pseudo_counts) / (len(joined_input_seq)+ self.pseudo_counts*4) for k, v in col_counter.items()}
        
        return null_distribution


    def make_profile_from_kmers(self, input_motifs):
        """
        The function calculates the distribution of bases for the provided input set of motifs. 

        input: set of motifs of kmer-size k. 
        return: a maxtrix of size 4xk. where the four rows are the bases {A,C,G,T} and the columns of size k each having the counts of bases.
        """
        profile_dict = {
            'A': [],
            'G': [],
            'C': [],
            'T': []
        }

        for i in range(self.kmer_size):
            col = []
            for j in range(len(input_motifs)):
                col.append(input_motifs[j][i])
            col_counter = dict(Counter(col))
            
            nset = len(self.input_sequences)
            profile_dict['A'].append(col_counter['A']+self.pseudo_counts if 'A' in col_counter else 0+self.pseudo_counts)
            profile_dict['C'].append(col_counter['C']+self.pseudo_counts if 'C' in col_counter else 0+self.pseudo_counts)
            profile_dict['T'].append(col_counter['T']+self.pseudo_counts if 'T' in col_counter else 0+self.pseudo_counts)
            profile_dict['G'].append(col_counter['G']+self.pseudo_counts if 'G' in col_counter else 0+self.pseudo_counts)
        return profile_dict
    
    def calcRelativeEntropy(self, input_motifs):
        """
        The function considers the null model and compares it with the experimental model inorder to calculate the relative entropy.
        The function also call the make_profile_from_kmers function to create profile from the provided input set of motifs. 

        Input: the set of motifs of kmer-size k. 

        return: the relative entropy score of the provided set of motifs and the profile of the motifs. 

        """
        relative_motif_score = 0

        profile = self.make_profile_from_kmers(input_motifs)
        for col in range(self.kmer_size):
            for base in self.null_distribution:
                if profile[base][col] != 0: 
                    pr = profile[base][col]/(len(self.input_sequences) + (4*self.pseudo_counts))
                    relative_motif_score += pr * math.log2(pr / self.null_distribution[base])
        return relative_motif_score, profile

    def randomMotifSearch(self, input_seqs):
        """
        The function iterates for self.iterations times and initially selects random motifs of kmers-size k, and creates profile to select the next set of motifs which will be compared with the 
        previous set of motifs using relative entropy score. The process is repeated until the best relative entropy score is found. 

        Input: input sequences of the whole fasta file. 
        return: prints the consensus and the best relative entropy score of the given input. 
        """
        
        for i in range(self.iterations):
            
            random_kmers = self.selectRandomKmer(input_seqs) # initialized kmers randomly with kmer-size -k
            best_score_motif, best_profile = self.calcRelativeEntropy(random_kmers) # calculate the score of the randomly initialized kmers. Consider it as the best_score. 
            selected_motifs = random_kmers
            while True:
                new_motifs = self.newMotif(best_profile) # select new motifs from the current profile
                new_score, new_profile = self.calcRelativeEntropy(new_motifs)
                
                if new_score > best_score_motif:
                    selected_motifs = new_motifs
                    best_score_motif = new_score
                    best_profile = new_profile

                    if best_score_motif > self.best_score:
                        self.best_score = best_score_motif
                        self.best_motif = selected_motifs
                else:
                    break

        print("Found Consencus {} with score - {}".format(self.printConsencus(self.best_motif), self.best_score))

    def printConsencus(self, input_motifs):
        """
        Function: Using the best motif found using randomsearch algorithm, the function returns the consensus from the profile of the motif. 

        Input: input_motif - is the best motif found from the randmom search algorithm.

        return: the consensus motif of kmer-size k. 
        """
        motif_set = input_motifs
        profile = self.make_profile_from_kmers(motif_set)
        concensus = ''
        for i in range(self.kmer_size):
            max_prob = 0
            max_base = 'A'
            for base, probs in profile.items():
                if probs[i] > max_prob:
                    max_prob = probs[i]
                    max_base = base
            concensus+= max_base
        return concensus

def main(inFile="", options = None):
    ''' Setup necessary objects, read data and print the final report.'''
    cl = CommandLine(options) # setup the command line
    sourceReader = FastAreader(inFile) # setup the Fasta reader Object
    sequenceList = []
    for head, seq in sourceReader.readFasta():       # reading the fast file. 
        sequenceList.append(seq)
    
    thisGenome = Genome(cl.args.i, cl.args.k, cl.args.p, sequenceList)
    thisGenome.randomMotifSearch(sequenceList)

if __name__ == "__main__":
    # main()
    inpFile = "p1860Crisprs"
    main(inpFile, options = [ "-i 1000", "-k 13", "-p 1"])

Found Consencus GAAAAACTTAAAA with score - 8.217968381511612
