Problem
A matrix is a rectangular table of values divided into rows and columns. An m×n
 matrix has m
 rows and n
 columns. Given a matrix A
, we write Ai,j
 to indicate the value found at the intersection of row i
 and column j
.

Say that we have a collection of DNA strings, all having the same length n
. Their profile matrix is a 4×n
 matrix P
 in which P1,j
 represents the number of times that 'A' occurs in the j
th position of one of the strings, P2,j
 represents the number of times that C occurs in the j
th position, and so on (see below).

A consensus string c
 is a string of length n
 formed from our collection by taking the most common symbol at each position; the j
th symbol of c
 therefore corresponds to the symbol having the maximum value in the j
-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

In [1]:
def read_fasta(fasta):
    sequences = {}
    header = None
    seq = []
    fasta = fasta.splitlines()
    for line in fasta:
        if line.startswith('>'):
            if header is not None:
                sequences[header] = ''.join(seq)
            header = line[1:]  # Remove '>'
            seq = []
        else:
            seq.append(line)
    if header is not None:
        sequences[header] = ''.join(seq)  # Add the last sequence
    return sequences

In [2]:
def create_profile(seq_dict):
    result_matrix = [[0] * 4 for _ in list(seq_dict.values())[0]]
    matrix_dict = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    for seq in seq_dict.values():
        for i, base in enumerate(seq):
            result_matrix[i][matrix_dict[base]] += 1
    t_result_matrix = list(map(list, zip(*result_matrix))) 
    max_index = []  
    for counts in result_matrix:
        max_index.append(counts.index(max(counts)))
    result_seq = ''.join([list(matrix_dict.keys())[i] for i in max_index])
    return result_seq, t_result_matrix

In [3]:
import numpy as np

def create_profile(seq_dict):
    result_matrix = np.array([[0] * 4 for _ in list(seq_dict.values())[0]])
    matrix_dict = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    for seq in seq_dict.values():
        for i, base in enumerate(seq):
            result_matrix[i][matrix_dict[base]] += 1

    max_index = []  
    for counts in result_matrix:
        max_index.append(np.argmax(counts))
    result_seq = ''.join([list(matrix_dict.keys())[i] for i in max_index])
    return result_seq, result_matrix.T

In [4]:
sample_fasta = """>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT"""

In [5]:
create_profile(read_fasta(sample_fasta))

('ATGCAACT',
 array([[5, 1, 0, 0, 5, 5, 0, 0],
        [0, 0, 1, 4, 2, 0, 6, 1],
        [1, 1, 6, 3, 0, 1, 0, 0],
        [1, 5, 0, 0, 0, 1, 1, 6]]))