# Problem definition

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [95]:
from rosalind_modules.generaltools import checker
from rosalind_modules.generaltools import func_to_file

def gc_content(seq):
    count = seq.count('G')+seq.count('C')+seq.count('c')+seq.count('g')
    gc_percent = 100.0*count/len(seq)
    return gc_percent

def FASTA_parser(fasta_infile):
    dct = {}
    with open(fasta_infile,'r') as infile:
        lines = [a.strip('\n') for a in infile.readlines()]
        for line in lines:
            if line.startswith('>'):
                current_id = line[1:]
                dct[current_id] = ''
            else:
                dct[current_id] += line
    return dct

@func_to_file('GC_out.txt')
def highest_gc(seqs_dct):
    gcs_dct = {}
    max_gc = 0
    for a in seqs_dct:
        gc = gc_content(seqs_dct[a])
        gcs_dct[a] = gc
        if gc > max_gc:
            max_gc = gc
            max_gc_key = a
    return max_gc_key+'\n'+str(max_gc)

In [97]:
fasta = FASTA_parser('sample_in.txt')
highest_gc(fasta)
checker('GC_out.txt','sample_out.txt')

False

In [98]:
fasta = FASTA_parser('rosalind_gc.txt')
highest_gc(fasta)