# Consensus and Profile
### Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the jth position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

DNA strings

A T C C A G C T

G G G C A A C T

A T G G A T C T

A A G C A A C C

T T G G A A C T

A T G C C A T T

A T G G C A C T

Profile

A   5 1 0 0 5 5 0 0

C   0 0 1 4 2 0 6 1

G   1 1 6 3 0 1 0 0

T   1 5 0 0 0 1 1 6

Consensus	

A T G C A A C T

##### Given: 
A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

##### Return: 
A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

### Sample Dataset
(ignore #. It is there for formatting)
#>Rosalind_1

ATCCAGCT

#>Rosalind_2

GGGCAACT

#>Rosalind_3

ATGGATCT

#>Rosalind_4

AAGCAACC

#>Rosalind_5

TTGGAACT

#>Rosalind_6

ATGCCATT

#>Rosalind_7

ATGGCACT

### Example Output
ATGCAACT

A: 5 1 0 0 5 5 0 0

C: 0 0 1 4 2 0 6 1

G: 1 1 6 3 0 1 0 0

T: 1 5 0 0 0 1 1 6

In [1]:
def consensus(s):
    
    # Read file and pull out text as a list of strings
    file = open(s, "r")
    data = file.readlines()
    
    # Remove all '>Rosalind_...' tags and remove new line characters
    count = 0
    while count<(len(data)-1):
        if data[count][0] == '>':
            del data[count]
            if count != (len(data)-1):
                data[count] = data[count][:-1]    
        count = count+1
    
    # Generate consensus and profile lists
    a_prof = []
    c_prof = []
    g_prof = []
    t_prof = []
    cons = []
    # Count number of A, C, G, or T for profile
    for i in range(0,len(data[0])):
        numA = 0
        numC = 0
        numG = 0
        numT = 0
        for j in range(0,len(data)):
            if data[j][i] == 'A':
                numA = numA + 1
            elif data[j][i] == 'C':
                numC = numC + 1;
            elif data[j][i] == 'G':
                numG = numG + 1
            elif data[j][i] == 'T':
                numT = numT + 1
            else:
                print('This is not actually DNA')
                return
        # Add counts of A, C, G, or T to appropriate profile
        a_prof.append(numA)
        c_prof.append(numC)
        g_prof.append(numG)
        t_prof.append(numT)
        # Compare counts and add nucleotide with greatest count to consensus
        if numA>numC and numA>numG and numA>numT:
            cons.append('A')
        elif numC>numA and numC>numG and numC>numT:
            cons.append('C')
        elif numG>numC and numG>numA and numG>numT:
            cons.append('G')
        else:
            cons.append('T')
    # Formatting for output
    DNAcons = ''
    DNA_a = 'A:'
    DNA_c = 'C:'
    DNA_g = 'G:'
    DNA_t = 'T:'
    for i in range(0, len(cons)):
        DNAcons=DNAcons+cons[i]
        DNA_a = DNA_a +' '+str(a_prof[i])
        DNA_c = DNA_c +' '+str(c_prof[i])
        DNA_g = DNA_g +' '+str(g_prof[i])
        DNA_t = DNA_t +' '+str(t_prof[i])
    print(DNAcons)
    print(DNA_a)
    print(DNA_c)
    print(DNA_g)
    print(DNA_t)

#Test
consensus('./Sample Dataset.txt')


ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
