Computing GC Content solved


#Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [29]:
#读入输入文件 fasta文件
def parse_fasta(fasta_file):
    strings = []
    current_label = ''
    current_string = ''
    with open(fasta_file, 'r') as f:
        for line in f:
            if line.startswith('>'):
                if  current_label and current_string:
                    strings.append((current_label, current_string))
                current_label = line[1:].rstrip()
                current_string = ''
            else:
                current_string += line.rstrip()
        if current_label and current_string: #处理最后一个序列
            strings.append((current_label, current_string))
    return strings



In [30]:
#计算输入序列的GC含量，返回GC含量百分比
def gc_content(string):
    gc_counts=string.count('G')+string.count('C')
    gc_content=gc_counts/len(string)*100

    return gc_content

In [31]:
#输出GC含量最高的序列，名称和对应的序列
def max_gc_content(strings):
    max_gc=0
    max_gc_label=''
    max_gc_string=''
    max_gc_content=0
    for label, string in strings:
        gc=gc_content(string)
        if gc>max_gc:
            max_gc=gc
            max_gc_label=label
            max_gc_string=string
            max_gc_data=gc
            
    return max_gc_label,max_gc_string,max_gc_data
    


In [32]:
# Read input file
input_file = r'D:\Study\Shanghaitech1\Rotation4\Bioinfo\Bioinfo\python\Python_Rookie_exercises\input_test4_202306120.txt'
strings = parse_fasta(input_file)
print(strings)
print(len(strings))
print(strings[0][1])

[('Rosalind_6404', 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'), ('Rosalind_5959', 'CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC'), ('Rosalind_0808', 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT'), ('Rosalind_1234', 'CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGATATCCATTTGTCAGCAGACACGC')]
4
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG


In [33]:

# Find string with the highest GC-content
max_gc_label, max_gc_string,max_gc_data = max_gc_content(strings)
max_gc_data=str(max_gc_data)+'%'
print(max_gc_label, max_gc_string) #用:拼接
print(max_gc_label+':'+max_gc_string) #用+拼接
#打印  GC含量最高的序列 名字，序列，GC含量
print("GC含量最高的序列的名字: {}, GC含量最高的序列的序列: {}, 它的GC含量为: {}".format(max_gc_label, max_gc_string, max_gc_content))


Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Rosalind_0808:CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
GC含量最高的序列的名字: Rosalind_0808, GC含量最高的序列的序列: CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT, 它的GC含量为: <function max_gc_content at 0x000001D2A80CA170>
