<a href="https://colab.research.google.com/github/nitrozyna/Rosalind/blob/master/6_gc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem description:
[Computing GC Content](http://rosalind.info/problems/gc/
)

**Identifying Unknown DNA Quickly**
The GC-content of a **DNA string** is given by the percentage of **symbols** in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the **reverse complement** of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called **FASTA format**. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
---

### Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

### Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on **absolute error** below

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540

In [0]:
#@title Importing some modules to make a connection between Colab and Drive to download the current dataset
!pip install PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)


In [0]:
#@title Loading test dataset
fileID = "1PCvTABDLa4mla9GH4Sk2JNSZKrcgdkCT" #@param {type:"string"}
downloaded = drive.CreateFile({'id':fileID})
downloaded.GetContentFile('rosalind_gc.txt')  # replace the file name with your file

In [0]:
#@title Function calculting the GC content
def GCContent(read):
    gc_content = 0
    all_bases = len(read)
    for base in read:
        if base == "G" or base == "C":
            gc_content += 1
    return gc_content/all_bases * 100


In [34]:
#@title Finding the read with highest GC content
dna_string =  open("rosalind_gc.txt","r")
gc_content = {}
with open('rosalind_gc.txt','r') as f:
    for line in f:
        if line.startswith(">"):
            read_name = line
            gc_content[read_name] = ""
        else:
            gc_content[read_name] += line.strip()
        
new_dict = {}
for k,v in gc_content.items():
    new_dict[k] = GCContent(v)

max_content = max(new_dict, key=new_dict.get)
print("%s%f"%(max_content[1:],new_dict[max_content]))

Rosalind_6693
51.177730
