# Computing GC Content
**Given:** At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

**Return:** The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.


# Sample Dataset

In [1]:
%%file Sample_Dataset.txt
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT



Overwriting Sample_Dataset.txt


# Sample Output

In [2]:
%%file Sample_Output.txt
Rosalind_0808
60.919540



Overwriting Sample_Output.txt


# Solution

In [3]:
import collections

def parseFastaFile(fasta_file_path):
    fasta_file = open(fasta_file_path,'r')
    fasta_file_lines = fasta_file.readlines()
    
    fasta_records = collections.OrderedDict()
    
    for line in fasta_file_lines:
        if line[0] == ">":
            fasta_records[line[1:].rstrip()] = ""
        else:
            fasta_records[next(reversed(fasta_records))] += line.rstrip()
            
    fasta_file.close()
    
    return fasta_records


In [4]:
fasta_records = parseFastaFile("Sample_Dataset.txt")
sequences = list(fasta_records.values())
names_of_sequences = list(fasta_records.keys())
gc_content_of_sequences = [100 * ((sequence.count("G") + sequence.count("C"))) / len(sequence) for sequence in sequences]

gc_content_of_sequence_with_max_gc_content = max(gc_content_of_sequences)
name_of_sequence_with_max_gc_content = names_of_sequences[gc_content_of_sequences.index(gc_content_of_sequence_with_max_gc_content)]


In [5]:
import itertools

def getMaxGCContent(fasta_records):
    "Given at most 10 DNA strings in FASTA format (of length at most 1 kbp each), return the ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below."
    sequences = list(fasta_records.values())
    names_of_sequences = list(fasta_records.keys())
    gc_content_of_sequences = [100 * ((sequence.count("G") + sequence.count("C"))) / len(sequence) for sequence in sequences]

    gc_content_of_sequence_with_max_gc_content = max(gc_content_of_sequences)
    name_of_sequence_with_max_gc_content = names_of_sequences[gc_content_of_sequences.index(gc_content_of_sequence_with_max_gc_content)]
            
    return (name_of_sequence_with_max_gc_content, gc_content_of_sequence_with_max_gc_content)


In [6]:
def getMaxGCContentFromFileToFile(input_file_path, output_file_path):
    "Wraps getMaxGCContent to read from input_file_path and write to output_file_path"
    
    output_file = open(output_file_path, 'w')
    output_data = getMaxGCContent(parseFastaFile(input_file_path))
    output_string = "\n".join(map(str,list(output_data)))
    output_file.write("%s\n" % output_string)
    output_file.close()
    
    return


# Test Solution

In [7]:
getMaxGCContentFromFileToFile("Sample_Dataset.txt", "Test_Output.txt")

In [8]:
%%bash
echo Sample_Output.txt
md5sum Sample_Output.txt
cat Sample_Output.txt

Sample_Output.txt
e58f0a54bbaf8c2aaf8fa7ea1c32e792  Sample_Output.txt
Rosalind_0808
60.919540


In [9]:
%%bash
echo Test_Output.txt
md5sum Test_Output.txt
cat Test_Output.txt

Test_Output.txt
3b24a4c35f6869f2716c37784e7b3f95  Test_Output.txt
Rosalind_0808
60.91954022988506


# Downloaded Dataset

In [10]:
%%bash
cp ~/Downloads/rosalind_gc.txt ./
cat rosalind_gc.txt

>Rosalind_3350
CCACCGGCACTCCGTAGAGATAATTTAATGTATAGCCCTAATCCACCATCATGTCACGTA
GATTGGTACTCGCGTACGTAAACTGGGAGCACAGGGCGATCAGTCCATCCGAGCGTCAAC
TAACGGCTGCCATCCAATAGGCAGCAGAGTCCCTTAGACCCTCCCGTAGCAGTCATTTTG
TTTCCACGTAATTTTACTACGAACACGAGGCTCTGCCTTTCGCTTGCTTGAACCAATTCC
AGCAATGGCTGTCGCTGTTCTTTGGCGGTTGCAAAATTCGCGCCATTCTTGAGTAGAACG
TCTATAGGGGAAGAGGAGCCTTAAATTCCTGGTATTGTTTTAAGGGGTGGTCTCTTCAAA
GAAGACCTTCTAGCTTTCAATACAGGGGGCGGTCCTGGTTACCCGCAGTTATCAGTCGGC
TGTAAGCGAGCTGTACTCACAAAATTTACGTGAAGCCGCCCTTTCAAGTCGGCTTCATAC
TTCGAATATGAGCCAATATCTTGCAGCAGACGGTACTTCACGAGGCCTACTCTGCTTGCT
CTGGATACCCCGCGGTTTAAGGGTTCTGTGGCCAGCATCAGAGGCGTGAAGCGACCTCTA
GGGTGCTACGTATAGCTAAGTGGGCGCTTCACGTAGAAACCGTGTTCTCCGCGTCATCTA
GATTGAGCTCTGACGTGGAAGCGCGAAACTTTGCTACGCAGGCACCAGCCGTCTGTGTAT
GAACAGCTATCGCCCGGAGACTACAGCCTATACCCAGGCTCTTCTTGCCGAAGGAATGCC
GCCCTGACGGCTTCAGAGCCCAAAGGGTCGCGTAGATATATATGTTT
>Rosalind_7302
GACCGCCGTCCCACGAGTTCGTCGCCCAGGGCCTCTACGAGAGGGAGTGGAAGCGTTAGA
TCCCCACGTATTGGGCCGTCTGCCTTGCGTGGTGAATTGAAGGGGCCTATATTCCGTTAC
TACAGCT

# Solution to Downloaded Dataset

In [11]:
getMaxGCContentFromFileToFile("rosalind_gc.txt", "Solution_Output.txt")

In [12]:
%%bash
cat Solution_Output.txt

Rosalind_4067
53.36225596529284
