### Computing GC Content 

#### Identifying Unknown DNA Quickly
Languages can be identified using software by analyzing the frequency of each letter. Each language has its own letter frequency, and the same can be said for genomes (e.g. human vs. animal)

Although two members of the same species have different genomes, they share 99.9% of the 3.2 billionbase pairs in a human genome (excluding those with major genetic defects). An average case genome such as this can be assembled for any species. 

In a double stranded molecule cytosine and guanine will always appear in equal amounts (G&C). GC content (% of bases that are either cytosine or guanine) can be used to differentiate many prokaryotes and eukaryotes by using small DNA samples. 


#### Problem 
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below. 

Sample  Dataset: 
>\>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG

>\>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC 

>\>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output: 
>Rosalind_0808

>60.919540


In [1]:
# initialize lists 

ros_id = []
gc_content= []

# open file & split FASTA strings
with open("datasets/rosalind_gc.txt") as f:
    fasta = f.read().replace("\n","").lstrip(">").split(">")

for val in fasta:
    if val:
        # save Rosalind ID
        ros_id.append(val[:13])

        # cleanup 'Rosalind_xxxx' from strings
        string = val[13:]
        
        # calculate gc content 
        num_gc = string.count("G") + string.count("C")
        num_bases = len(string)

        gc = num_gc/num_bases * 100 
        gc_content.append(gc)

# find index of FASTA string with highest GC content
max_id = gc_content.index(max(gc_content))

# print results 
print(ros_id[max_id])
print(round(gc_content[max_id],6))
    

Rosalind_7004
51.672862


In [3]:
# using pandas 
import pandas as pd 

with open("datasets/rosalind_gc.txt") as f:
    fasta_list = f.read().replace("\n","").lstrip(">").split(">")

# initialize df
df_fasta = pd.DataFrame()

# populate df
df_fasta = pd.DataFrame(fasta_list,columns = ['raw'])
df_fasta['ros_id'] = df_fasta['raw'].str[:13]
df_fasta['string'] = df_fasta['raw'].str[13:]

# calculate gc content
count_gc = df_fasta['string'].str.count("G") + df_fasta['string'].str.count("C")
count_bases = df_fasta['string'].apply(len)
df_fasta['gc_content'] = (count_gc / count_bases) * 100  

# find index of max gc_content 
maxgc_id = df_fasta['gc_content'].idxmax()

# print max gc content and Rosalind ID 
print(df_fasta['ros_id'][maxgc_id])
print(f"{df_fasta['gc_content'][maxgc_id]:.6f}")

Rosalind_7004
51.672862
