<a href="https://colab.research.google.com/github/nitrozyna/Rosalind/blob/master/10_cons_numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem description:
[Consensus and Profile](http://rosalind.info/problems/cons/
)

A **matrix** is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of **DNA strings**, all having the same length n. Their **profile matri**x is a 4×n **matrix** P in which P1,j represents the number of times that 'A' occurs in the jth **position** of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A **consensus string** c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

                            A T C C A G C T
                            G G G C A A C T
                            A T G G A T C T
            DNA Strings     A A G C A A C C
                            T T G G A A C T
                            A T G C C A T T
                            A T G G C A C T

                        A   5 1 0 0 5 5 0 0
           Profile	  C   0 0 1 4 2 0 6 1
                        G   1 1 6 3 0 1 0 0
                        T   1 5 0 0 0 1 1 6

           Consensus	    A T G C A A C T

---

### Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

### Return:  A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6



In [0]:
#@title Importing some modules to make a connection between Colab and Drive to download the current dataset
!pip install PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


In [0]:

#@title Loading test dataset
fileID = "13Y9MgQNC2WhbFqoCOgkiIQDfOiI2I_mp" #@param {type:"string"}
downloaded = drive.CreateFile({'id':fileID})
downloaded.GetContentFile('rosalind_cons.txt')  # replace the file name with your file

In [0]:
#@title Reading the file to get all reads
with open('rosalind_cons.txt','r') as f:
    dna = ""
    all_reads = []
    for read in f:
        read = read.strip()
        if read.startswith(">"):
            all_reads.append(dna)
            dna = ""
        else:
            dna += read
    all_reads.append(dna)
read_len = len(all_reads[1])        

In [0]:
#@title Function for creating a profile matrix
import numpy
read_len = len(all_reads[1])
zero_matrix = numpy.zeros(shape=(1,read_len))
def getMatrix(base,reads,matrix):
    for read in reads:
        for idx,b in enumerate(read):
            if b == base:
                matrix[0][idx] = matrix[0][idx] + 1 
    return matrix

In [0]:
#@title Getting the profile matrices for each base
a = (getMatrix('A',all_reads[1:],numpy.zeros(shape=(1,read_len))))
c = (getMatrix('C',all_reads[1:],numpy.zeros(shape=(1,read_len))))
g = (getMatrix('G',all_reads[1:],numpy.zeros(shape=(1,read_len))))
t = (getMatrix('T',all_reads[1:],numpy.zeros(shape=(1,read_len))))

In [0]:
#@titile Getting the consensus string
cs_max = np.maximum.reduce([a,c,g,t])
cons = ""
for idx,b in enumerate(cs_max[0]):
    #print(idx)
    if (b == a[0][idx]):
        cons += 'A'
    elif (b == c[0][idx]):
        cons += 'C'
    elif (b == g[0][idx]):
        cons += 'G'
    elif (b == t[0][idx]):
        cons += 'T'

In [85]:
#@title Pretty printing the matrices and the consensus string
print(cons)
print('A:', ' '.join(map(str, map(int, a[0]))))
print('C:', ' '.join(map(str, map(int, c[0]))))
print('G:', ' '.join(map(str, map(int, g[0]))))
print('T:', ' '.join(map(str, map(int, t[0]))))

CTTTGAATAACAGGTAAAACCACCAGACCTCTTCTTGAGAAACGCTTCGATCACCAAACATCCAACCGGGGATTGAGATTAAGAGCCTGTCAAACACGCAAATCACCTACAACTAGAGCCAACCCCTGGGCCCTTTCCCCAAGAGGTACTCAGAACAGCAACGATGGTACTCCATAGTACAAACTCGACGCAGACGAGAATAACAAAGTGTTTGAGAATCTATTCTCGTCAAAACACACATTGTCCACCCAAGCCTGAACAAAATAATTCATTTAGGCAACTACGTGCAGGACCGTGAGAAACCGATGACCCGGTATATCTGGAGCTACAACTGGGGTAACGAAACCTCATTGCGTAAATCCCTTAGCACAAAAAACCGTCTTCTAGGTACGCAGACCAAACAGTGCGCACTAATTTCGACACCCGTGGGGTGAAACGCTGATTAGCATATTAAAACTTGAACCTTAGACAATTACGATATCCGTATTTCTCGACCGGTGCCGAGGGAACTAAAAACTCCTGGGGCAGCTAGACAGAAAAAAGAGAGCGATAGAAAAACAACAGCGGACCCTCGATATAAAACAGGGTCAGAACGGTGACCTGCCCAAACCCAGAAAGTAGGGACCACAGTAGAACACCTCGAGTGAAATTGACAACAGGCCTCCTAGATAGAATGACTGATGTCAACCGCAAAGATGCCCTGAATTTCTAGCTACACGGCGCGTCTTTTGGATGCTTTCGCAACAAGTTAAGCGACCTAAAACTCATCATCAGACATTTACGGGGGCCTACCGCACACGCTCGCCATGCATAACTAACCAGCCCTCCTAGTAGAGAAACCCACGATGCTAAGACAGAGGTCGCCAAGCCGTAAACTACGTAGACTGGACAGGAAACTCCTGGTGTTGGTCAACGTACACCCCACAAGTGAGTAGAACACATGAAGCACCTATC
A: 0 1 1 2 2 6 4 1 4 3 2 5 2 1 3 4 4 4 4 2 1 