# The GC calculator

```
I downloaded the genome of *Utricularia gibba* (humped bladderwort) from [NCBI](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_002189035.1/) and decompressed it. The genome has one file per chromosome (e.g. "chr1.fna"). You can download the data for yourself to take a look!

There are two goals for this code:
* Get the label of the sequence with the largest percentage of GC content
* Calculate the GC percentage of the whole genome

However, there are a couple of issues. Can you help me fix the code?
```

Use the controls above to run blocks of code. Let's take a look at each section:

First the imports. Here I will only import the `Path` class to make it easier to find the fasta files

In [None]:
from pathlib import Path

## `parse_file`

Currently, using this function takes more than 4 hours. How can we speed this up? See the "[Tips and Tricks](https://brightspace.wur.nl/d2l/le/content/190722/viewContent/796144/View)" module in Brightspace.

Additionally, make sure that the data is [correctly read](https://brightspace.wur.nl/d2l/le/content/190722/viewContent/814684/View) into the data structure, otherwise it can cause issues later on when you need to process it!

In [None]:
def parse_file(filename, data):
    """Parses a fasta file with a single record

    Parameters:
    ---

    filename: Path
        The path to a fasta file

    data: dict
        A dictionary where label, sequence will be stored

    Returns:
    ---
    
    data: dict
        The updated dictionary
    """

    label = ""

    with open(filename, "r") as f:
        all_lines = f.readlines()

    # First line is the label. Also, remove '>' symbol
    label = all_lines[0][1:]

    # label shouldn't be repeated
    data[label] = ""
    for line in all_lines[1:]:
        data[label] += line

    return data

## `gc_calculator`

Next, calculation of the GC content. The numbers should match the ones on the [NCBI genome page](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_002189035.1/). What is wrong?

Also, the label of the sequence with the highest GC content is also incorrect. Can you spot the error?

In [None]:
def gc_calculator(data):
    """Calculates the total GC content

    Parameters
    ---

    data: dict
        A dictionary with key=records' labels, values=sequences

    Returns:
    ---

    record_max_gc: str
        The label of the sequence with the highest gc content
    total_gc: float
        The percentage of gc content for all data
    """

    # counters for total gc, atgc
    gc = 0
    atgc = 0

    # counters for gc, atgc per record
    record_gc = 0
    record_atgc = 0

    label_max_gc = "" # label of sequence with highest gc percentage
    max_gc_record = 0 # keeps the highest gc percentage per record

    # get both key and value
    for label, sequence in data.items():
        atgc += len(sequence)
        record_gc = 0
        
        for char in sequence:
            if char.lower() in {"g", "c"}:
                gc += 1
                record_gc += 1
            else:
                atgc += 1

        # Substitute label_max_gc if better gc found
        if record_gc > max_gc_record:
            max_gc_record = record_gc
            label_max_gc = label
        

    return label_max_gc, gc/atgc

## `main`

The `main` function divides the task in three major blocks: reading the data, processing it, and output of requested information.

### Think:

A dictionary was chosen as the data structure to hold the data. What data structure would you choose? See some options [here](https://brightspace.wur.nl/d2l/le/content/190722/viewContent/816822/View).

In [None]:
def main():
    data = {}

    # obtain the paths to every file with .fna extension
    print("Parsing fasta files:")
    for fasta in sorted(Path("./").glob("*.fna")):
        print(fasta.stem)
        data = parse_file(fasta, data)

    # get the gc numbers
    print("\nCalculating gc percentage values...\n")
    record_max_gc, total_gc_percentage = gc_calculator(data)

    # Print results:
    print(f"Sequence with max GC percentage: {record_max_gc}")
    print(f"Genome GC content: {total_gc_percentage}")

### `__main__`

Finally, the entry point for the `main` function. See some videos about the importance of this section:
* [You should put this in all your Python scripts | if __name__ == '__main__':](https://www.youtube.com/watch?v=g_wlZ9IhbTs)
* [Python Tutorial: if __name__ == '__main__'](https://www.youtube.com/watch?v=sugvnHA7ElY)
* [What is Python's Main Function Useful For?](https://www.youtube.com/watch?v=lVUOrPunRxQ)

In [None]:
if __name__ == "__main__":
    main()