# Consensus and Profile
**Given:** A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

**Return:** A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)


# Sample Dataset

In [1]:
%%file Sample_Dataset.txt
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT


Overwriting Sample_Dataset.txt


# Sample Output

In [2]:
%%file Sample_Output.txt
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6



Overwriting Sample_Output.txt


# Solution

In [3]:
class FastaRecord:
    def __init__(self, name, sequence=""):
        self.name = name
        self.sequence = sequence

def parseFastaFile(fasta_file_path):
    fasta_file = open(fasta_file_path,'r')
    fasta_file_lines = fasta_file.readlines()
    
    fasta_records = []
    
    for line in fasta_file_lines:
        if line[0] == ">":
            fasta_records.append(FastaRecord(line[1:].rstrip()))
        else:
            fasta_records[-1].sequence += line.rstrip()
    
    fasta_file.close()
    
    return fasta_records


In [4]:
class ConsensusAndProfile:

    def __init__(self, fasta_records):
        "Given a collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format, return a consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)"
        
        self.consensus = ""
        self.profile = {'A':[], 'C':[], 'G':[], 'T':[]}
        
        sequences = [fasta_record.sequence for fasta_record in fasta_records]
        list_of_zipped_sequences = list(zip(*sequences))
        transposed_sequences = ["".join(position) for position in list_of_zipped_sequences]
        
        for base_position in transposed_sequences:
            base_consensus = ""
            max_base_count = 0
            for base in list(self.profile.keys()):
                base_count = base_position.count(base)
                self.profile[base].append(base_count)
                if base_count > max_base_count:
                    max_base_count = base_count
                    base_consensus = base
            self.consensus += base_consensus
    
    def __str__(self):
        summary_string = ""
        summary_string += self.consensus
        summary_string += "\n"
        
        summary_string += "A: "
        summary_string += " ".join(map(str, self.profile['A']))
        summary_string += "\n"
        
        summary_string += "C: "
        summary_string += " ".join(map(str, self.profile['C']))
        summary_string += "\n"
        
        summary_string += "G: "
        summary_string += " ".join(map(str, self.profile['G']))
        summary_string += "\n"
        
        summary_string += "T: "
        summary_string += " ".join(map(str, self.profile['T']))
        summary_string += "\n"
        
        return summary_string
    
    def __repr__(self):
        return "%s" % str(self)


In [5]:
def consensusAndProfileFromFileToFile(input_file_path, output_file_path):
    "Wraps ConsensusAndProfile to read from input_file_path and write to output_file_path"
    
    fasta_records = parseFastaFile(input_file_path)
    consensus_and_profile = ConsensusAndProfile(fasta_records)
    
    output_file = open(output_file_path, 'w')
    output_file.write("%s" % str(consensus_and_profile))
    output_file.close()
    
    return


# Test Solution

In [6]:
consensusAndProfileFromFileToFile("Sample_Dataset.txt", "Test_Output.txt")

In [7]:
%%bash
echo Sample_Output.txt
md5sum Sample_Output.txt
cat Sample_Output.txt

Sample_Output.txt
7ce2d1315cb7ec785ee52b68fac40bea  Sample_Output.txt
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


In [8]:
%%bash
echo Test_Output.txt
md5sum Test_Output.txt
cat Test_Output.txt

Test_Output.txt
7ce2d1315cb7ec785ee52b68fac40bea  Test_Output.txt
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6


In [9]:
%%bash
if [ $(md5sum Sample_Output.txt|cut -f1 -d' ') == $(md5sum Test_Output.txt|cut -f1 -d' ') ]
then
    echo Sample output matches test output.
else
    echo Sample output does not Match test output.
fi

Sample output matches test output.


# Downloaded Dataset

In [10]:
%%bash
cp ~/Downloads/rosalind_cons.txt ./
cat rosalind_cons.txt

GGCAGTATGCAGTATGGCAGTATGCAGTATGCAGTATGACAGTATGAGGTTTCAGTATGCCAGTATGTACAGTATGCGTCAGTATGTCAGTATGTGGCAGTATGCAGTATGCAGTATGCAGTATGGAACAGTATGCAGTATGCAGTATGTCAGTATGAGTTCAGTATGATCAGTATGCCTTGGACCCAGTATGCAGTATGATACAGTATGCAGTATGTCAGTATGATTCAGTATGTCAGTATGCCAGTATGCAGTATGATAACACAGTATGCAACCAGTATGCAGTATGCACAGTATGCCAGTATGCAGTATGGAGACATCGTCAGTATGCAGTATGCAGTATGCGCAGTATGCAGTATGAGCACAGTATGAGCTCACGCAGTATGCAGTATGTAACAGTATGCAGTATGTTCCAGTATGATTTTGTCAGTATGGTGCAGTATGCAGTATGGAAAGCAGTATGGATCAGTATGGTCAGTATGCAGTATGTACAGTATGCAGTATGATGCGCAGTATGCCGTCCAGTATGCGCAGTATGCAGTATGGAGATCAGTATGTCCAACCAGTATGTGGCAGTATGCAGTATGCAGTATGTACCCCAGTATGCAGTATGCAGTATGCAGTATGCAACAGTATGCCAGTATGACCAGTATGCAGTATGGTCGCCAGTATGAGCAGTATGTCGCAGTATGCAGTATGGGAAGCAGTATGCCAGTATGCTCAGTATGCCAGTATGGATCAGTATGTACCAGTATGATGTACAGTATGGGTGAACAGTATGCAGTATGTCAGTATGGCAGTATGTACCAGTATGTCATAACAGTATGGCAGTATGAGTGGGGCAGTATGCAGTATGCGGCAGTATGCCAGTATGCAGTATGCAGTATGCTGCAGTATGATCAGTATGACGGAACACACAGTATGACGGCAGTATGCAGTATGGCTGCAGTATGTCTTCAATTTAACAGTATGCAGTATGATCAGTATG
CAGTATGCA


# Solution to Downloaded Dataset

In [11]:
consensusAndProfileFromFileToFile("rosalind_cons.txt", "Solution_Output.txt")

FileNotFoundError: [Errno 2] No such file or directory: 'rosalind_cons.txt'

In [None]:
%%bash
cat Solution_Output.txt