**Consensus and Profile**

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the ntersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4×n matrix P in which P1,j represents the number of times that 'A' occurs in the j
th position of one of the strings, P2,j represents the number of times that C occurs in the j
th position, and so on (see below).

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the jth symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

In [2]:
import os
import pandas as pd

In [38]:

filepath = "/mnt/c/Data/ROSALIND_download/rosalind_cons.txt"
sequences = []
i = 0
with open(filepath, 'r') as f:
    for line in f.readlines():
        line=line.replace('\n','')
        if line.startswith(">"):
            i += 1
            # sequences.insert(i-1, "")
            sequences.append("")
            continue
        sequences[i-1] = sequences[i-1] + line
        
print(len(sequences), sequences[-1])


10 AGGCCGTAGATTAGGTTGTGTAGAGACACATTGCCGTGGATCAGGTCTTCACAGCCTACGTAGATCAACCATTTGGTTGAACATGCTACTCTAAATCGAGGCAAAATGACCGAATAGAAGCAAGGCGCGTCCGAGCTTTGTGTCACCCTAGGTTGACCCTAAGCTCACAATCCCGCTGGATGCTCCTAACCATGTCGTACCGCATTCTCGATAAGACTTCATAAGACGAAGTATGTAGTCGGAAACAGGGAGCTGCTAGGAAGGCCGGCTGGCAAAGTACATGGGTAGCATCTATACTCTCAAACTTGCATGACGAGGATCGTGTCTAGTAGGCTCCGTTCCCGACATCTCGAATGTAGACTAGGCAGGACTCGATCACTGCTAATGCATTGTTCCTGTGGAATTGCATGTCACTCTGATAGCAGCGACACCTATGCACAGTGAGTGTGCAAATCGACTTCGTGAGCGTTATATACCCCGACGGGCTGACCCCTCTTGACCGGTTTAGTACGTTAAAAATCCAGCCAGTCGACAACCGCTAACTTAATTCTTCCTGAGGCCGTGCAAGTGTGCATGTCTTAATTGAATGACAGAAGAGTTCTTCAAATACTTTCATGTAGGTACTAACATACCAGTACCTGCCAGCTAGACTAGAGTTATAATTTCTCGCGGTAGATGCGGGATGTATCACCCCGGCCATGCACCTTCATAGCAGGGGGAACTTGAACGGTACTATTACACTTCAACTGTCTGCGCCGAAACTGGACAAACGTTTCGGCGGATGCAAACGGAAACGGCATCAATCACCAAAGTATACGAAGCTAACTAGAGTGGACTGTCGAGCGGGATGGTACTTAGTTCAGTAATTGTAATAAGCGCGGAGTGTGGGTGATCCCCACATTTGGCGTTATTCTAGTACGCGCCGATTTTG


In [39]:
seq_list = [[chr for chr in seq] for seq in sequences]
seq_df = pd.DataFrame(seq_list)
seq_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,921,922,923,924,925,926,927,928,929,930
0,T,T,A,T,C,T,T,C,G,A,...,A,G,C,G,G,A,A,G,A,A
1,T,G,C,C,C,C,G,G,G,A,...,A,T,T,T,G,G,C,T,T,C
2,G,A,T,T,T,C,T,C,T,A,...,C,T,T,G,C,C,C,C,T,A
3,T,T,T,T,C,C,C,A,C,C,...,C,G,G,A,C,C,T,G,C,G
4,A,C,T,A,C,T,A,A,G,A,...,C,G,C,A,A,C,A,T,A,G
5,C,T,A,C,C,T,G,A,T,C,...,T,C,A,C,T,C,T,G,A,C
6,G,T,A,G,G,C,T,A,T,G,...,T,A,T,T,T,T,T,G,A,T
7,T,A,C,T,G,G,T,G,T,C,...,C,C,C,A,T,G,T,C,A,A
8,T,T,T,T,C,A,T,T,G,T,...,A,C,G,A,A,T,G,T,C,C
9,A,G,G,C,C,G,T,A,G,A,...,G,C,C,G,A,T,T,T,T,G


In [40]:
ATGC_list = []
for cols in seq_df.columns:
    col_list = list(seq_df[cols])
    A, C, G, T = col_list.count("A"), col_list.count("C"), col_list.count("G"), col_list.count("T")
    ATGC_list.append([A,C,G,T])
    
print(ATGC_list)

[[2, 1, 2, 5], [2, 1, 2, 5], [3, 2, 1, 4], [1, 3, 1, 5], [0, 7, 2, 1], [1, 4, 2, 3], [1, 1, 2, 6], [5, 2, 2, 1], [0, 1, 5, 4], [5, 3, 1, 1], [3, 2, 3, 2], [3, 3, 2, 2], [2, 3, 2, 3], [3, 3, 4, 0], [1, 3, 3, 3], [4, 0, 2, 4], [2, 1, 3, 4], [3, 2, 4, 1], [1, 4, 3, 2], [3, 3, 2, 2], [2, 1, 3, 4], [7, 2, 0, 1], [3, 4, 1, 2], [3, 2, 4, 1], [4, 2, 3, 1], [4, 1, 2, 3], [3, 4, 2, 1], [1, 2, 2, 5], [2, 2, 2, 4], [3, 2, 2, 3], [2, 4, 2, 2], [4, 1, 0, 5], [3, 4, 2, 1], [4, 3, 1, 2], [1, 5, 3, 1], [1, 3, 3, 3], [2, 1, 3, 4], [5, 2, 3, 0], [2, 3, 4, 1], [4, 3, 0, 3], [2, 3, 3, 2], [5, 2, 1, 2], [7, 1, 0, 2], [4, 2, 3, 1], [3, 3, 2, 2], [3, 2, 1, 4], [2, 3, 5, 0], [2, 0, 3, 5], [2, 3, 2, 3], [1, 5, 1, 3], [1, 5, 3, 1], [4, 3, 3, 0], [4, 4, 0, 2], [0, 4, 3, 3], [3, 4, 2, 1], [3, 4, 1, 2], [2, 1, 1, 6], [4, 0, 4, 2], [1, 2, 5, 2], [0, 2, 4, 4], [2, 1, 3, 4], [3, 3, 2, 2], [1, 2, 3, 4], [3, 2, 5, 0], [3, 3, 2, 2], [3, 3, 1, 3], [2, 2, 2, 4], [2, 5, 3, 0], [3, 2, 3, 2], [5, 1, 3, 1], [3, 2, 3, 2], [2, 2

In [41]:
ACGT_df = pd.DataFrame(ATGC_list, columns=["A","C","G","T"])
print(ACGT_df)

     A  C  G  T
0    2  1  2  5
1    2  1  2  5
2    3  2  1  4
3    1  3  1  5
4    0  7  2  1
..  .. .. .. ..
926  1  4  2  3
927  2  2  1  5
928  0  2  4  4
929  5  2  0  3
930  3  3  3  1

[931 rows x 4 columns]


In [None]:
consensus = []
for line in ATGC_list:
    ## 내가 처음에 작성한 코드. 최댓값 찾는 max 생각 못해서 복잡하게 짰음
    biggest = sorted(line, reverse=True)[0]
    seq = "A" if line.index(biggest) == 0 else "C" if line.index(biggest) == 1 else "G" if line.index(biggest) == 2 else "T"
    ## max 사용해 간단하게 작성한 함수
    seq2 = ["A","C","G","T"][line.index(max(line))]
    consensus.append(seq)
print(''.join(consensus))
assert seq == seq2 

TTTTCCTAGAAACGCATGCATACGAACTTACTCACCTAGACAAAATGTCCCAACCCTAGGTATGAATCAAATTTGCCCAACAATGTTAACATTACTTACGTCATACTGACAGCAGAGTACAAGTACAACTCCAACCATAGACCCATCGCACCTCGAAGTGAACCAGATAATTAAAGTGCAACCGTAAACAACGTGGCCAGATGGCATGACAAAAAACTGCTCATGTCAACGGAAGAAAACCGATGAATAAAACTGCCTCACAATCAGCCATTAACAAAACTTCGGAGAATGGGGGTGCCTCAATATTTCAGAAAACGGTACAATAGCTTATGCCACCGGTCCGAAAAAATCTAAGGCAAGCTTCCTATACAGGTAAGATGATAAATGGAGCCCACCCGCTGACGGCCATGCAACCAAGACGCAAACATCTGCTGGAGGCAGGGATTTTGACCGTCAGGAAAGAGCGGAAAGACCACCGAGACGTTCGAGACGCTCTTGAAGGGTAGAGGATATGACACTCCTCCAGGGACGATTAAGCAGCACTCACCCCTAGCAAAGGCCTCGCAGGGAACGACAGCTAAGCAGCTTCCCAAAATAGAGTGCCAAAATCGTTCCTTGTGAAACCATCATCATAGAATCTGCCTGGAGTACTGCATTCAAAAGTACTCAAGCGGCCCACAAGAAGTCAGTGGCCCAGTTAGTGCATGCAGATAATGCGGTATCTCACCACTAAAACGTTTCACACTCACATTGACACGGAATTTGAACTAAACGTAGCATTAAACTAAGGAAGGCGAGAACCTTAAGGAGCAGAAAGCGACCTGCCCAGACAGCGAGTGTCGGAGTGATGGAGGATATGACACTACCAATTCCGAAGGCTCAGTGACCCGAAATCGGAAACCGCTATCCCAGCCCCACCTGCCCAACTGAA


In [43]:
for col in ACGT_df.columns:
    print(f"{col}:", ' '.join(map(str,ACGT_df[col])))

A: 2 2 3 1 0 1 1 5 0 5 3 3 2 3 1 4 2 3 1 3 2 7 3 3 4 4 3 1 2 3 2 4 3 4 1 1 2 5 2 4 2 5 7 4 3 3 2 2 2 1 1 4 4 0 3 3 2 4 1 0 2 3 1 3 3 3 2 2 3 5 3 2 3 0 2 0 2 1 3 3 3 4 4 2 2 1 1 4 4 0 3 3 2 4 2 2 2 4 2 1 2 3 5 3 4 2 2 1 4 2 4 2 2 3 3 4 1 3 3 1 3 5 3 3 4 2 4 4 1 1 1 3 4 6 2 3 5 2 4 1 3 2 1 3 3 1 0 2 2 3 2 2 2 2 2 4 4 2 2 0 4 3 3 1 3 1 3 1 3 4 1 3 4 4 4 1 2 1 0 3 5 3 2 0 2 4 3 3 2 4 3 2 2 1 3 2 2 2 4 2 5 2 2 1 2 3 1 2 3 2 4 4 4 4 3 4 2 1 2 2 3 1 3 3 2 3 1 3 4 1 1 2 3 3 3 5 4 4 3 2 1 2 4 2 3 4 4 0 4 5 4 5 2 1 1 2 1 1 3 4 2 5 3 3 1 4 2 3 2 4 0 1 5 6 1 4 4 4 5 2 2 1 3 1 2 4 1 4 4 2 2 3 2 3 2 3 1 3 2 3 2 4 5 3 3 3 3 2 0 4 2 4 3 3 3 3 1 1 3 4 1 4 4 1 4 3 3 3 1 3 2 2 1 1 3 2 0 2 2 3 1 2 2 4 4 5 3 4 5 1 1 3 6 4 1 2 0 3 5 4 2 2 1 3 1 3 3 2 4 2 4 0 2 1 7 3 2 4 3 2 4 1 4 4 3 2 2 2 3 1 1 2 2 3 2 1 2 2 2 3 2 4 2 1 1 2 2 4 2 1 2 4 4 1 2 4 4 3 6 2 2 3 4 5 3 1 4 1 0 1 1 2 2 3 2 4 3 1 3 4 2 2 2 4 0 3 3 0 2 4 2 3 2 0 3 5 2 1 4 3 4 1 3 0 3 3 2 3 7 3 2 4 2 1 4 0 2 3 4 2 4 2 2 0 1 2 1 3 1 5 2 1 2 0 2 1 2 2 5