# L03 notebook

In this programming session, we will focus on the practical application of Python programming concepts in the context of biological data analysis.
The task involves creating a Python program capable of managing a dataset containing DNA sequences, sequence lengths, and organism names.
Throughout the session, we will cover fundamental Python elements, including variables, lists, dictionaries, conditional logic, loops, and functions.
The goal is to provide students with a solid understanding of how to utilize programming for the systematic analysis of biological datasets.

## Variables

In the initial phase of this coding activity, your first task is to declare variables that will serve as placeholders for essential biological data.
Specifically, we need variables to represent DNA sequences, sequence lengths, and organism names.
This step is crucial for initializing the foundation of our program.

### DNA sequences

Declare a variable to store three DNA sequences in a variable called `dna_sequences`.
These sequences are strings composed of nucleotide bases (`A`, `T`, `C`, `G`) and will be the focal point of our biological data.
Feel free to make these different lengths!

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        <code>dna_sequences</code> should be a collection.
    </p>
</div>

### Sequence Lengths

Create another variable called `sequence_lengths` to store the corresponding lengths of the DNA sequences.
The lengths will be integers representing the number of bases in each sequence.

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        You should <b>loop</b> over <code>dna_sequences</code> and add the length to <code>sequence_lengths</code>.
    </p>
</div>

### Organism names

Lastly, declare a variable called `organism_names` to store the names of the organisms associated with each DNA sequence.
These names will provide context to the biological data.

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        You should manually name them anything you would like.
    </p>
</div>

## Collecting the data

Now that we have our variables representing biological data, the next step involves organizing and managing this information systematically.
To achieve this, we will use a dictionary&mdash;a powerful data structure in Python.

### Create a DNA Data Dictionary

Introduce a dictionary named `dna_data` to comprehensively store information about each DNA sequence.
This dictionary will serve as a centralized repository, associating each sequence with its length and the respective organism.

Start with an empty dictionary, and then iterate over  `organism_names`, `dna_sequences`, and `sequence_lengths` using only one `for` loop.
Each dictionary key should be exactly one name, and each value should be another dictionary with keys `sequence` and `length` with their respective data.

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        You can also loop over the index of each element in the list.
    </p>
</div>

### Get DNA information

Retrieve the sequence from one of your organisms using one of your keys and print it.

### Fix a correction

Whoops, you forgot to add `ATTGTGC` to the end of one of your sequences.
Fix the sequence and update the length.

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        Add the string `"ATTGTGC"` to one of your `sequence` keys and then recalculate the length.
    </p>
</div>

## Short sequences only

As we delve deeper into our biological data, it becomes essential to introduce conditional statements that allow us to make decisions based on specific criteria.
In this section, we will implement if/else conditional logic to check if a DNA sequence exceeds a defined length threshold.

### Define the length threshold

Start by defining a length threshold that serves as the criterion for our conditional check.
This threshold will determine whether a DNA sequence is considered "long" or "short."

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        All you need is to define an <code>int</code>.
    </p>
</div>

### Implement Conditional Logic

Utilize `if`/`else` statements to check each DNA sequence against the defined length threshold.
In this example, we'll print whether each organism's DNA sequence is considered `"long"` or `"short"`.

<div class="admonition tip">
    <p class="admonition-title">Hint</p>
    <p style="padding-top: 1em">
        You should <b>loop</b> over <code>dna_data</code> and print the <code>"long"</code> and <code>"short</code>.
    </p>
</div>

## Looping

Utilize a `for` loop to iterate through each organism's DNA data.
Print relevant information such as organism name, DNA sequence, and length for each iteration.

## Let's get more realistic

I am going to create a long random nucleotide sequence.
Do not worry too much about understanding the cell directly below, but do not change.

In [1]:
import random

random.seed(8793624980237)

bases = ["A", "T", "C", "G"]


def generate_realistic_sequence(length):
    global bases
    return "".join(random.choice(bases) for _ in range(length))


realistic_sequence = generate_realistic_sequence(1000)
print(realistic_sequence)

AGGATACAACCAGGGTAGGCGGGAGTAGTATTCGAAGATTTGAGTTATGTCAAACCAAATGAGATACGGAATTTTTAGTCTTCTATAGTTAACTACTGCAGATCGCCCTCATTTCGTCGTACTACCTCGAACTATCGCGGATTGTAAGATTACGATTTTCGGGGGATTAATCCCCGGCAGTGGATTAAGGTCAAAAACAAGAAACCGACACACATATTGCGATATCCTGAGCGCTGCGCTTGAATCATCCCTCTTAGGACGAGAATCAGGAACAATTGGGATCTTGCCGATGGGAATGGCTGTAACGGCGCCACCGGGAATGAACTTAGAACCTCAAAGTGAGCTCTCATCCGGTAACTACGGCTGGGCGCTAGTGGTTCGATTGGACACGATGGCATAGTCAGTTGGAGAAAATCGAACTGAACCTAATCTGGTAAAGTAAGGTCCCACTTCATTGGTTCACCCCCGCGTTGGGTTTCACCAAGTTGTTTCCGATTAGGAGGTACATTATCATTTCCATCTTAGTGTGTTCAGCAGCACGTCTAAATAGGGCGTATACTGCCACGATCTCAACAGCAGTCGCTTAATCCGTGCTCGCCTGACGGTAGACTTTGGGACGACAGGGCATGACATTACAGGGGAGGATTCAAGTGTTCACTGAACAGATGAAGTGGATTTTGTGGAGTGGGCAGGCTAACATCTCAAATCTAGAGAGGCATGCACCGTGCTTCTGATTGACGAAGAGCAATGCTCCTGCCGTTGAATCCCCCCCACAATCGACGCGTCACGCACGGTCTGCTTTGCAAGTCGAGTATCTAGCACTACAGCACACGCCCCTAGCGTCGCACGTGAGTCTACCCTTACACAGTTGATGAACCGTCCGTACAGCGGTCGCGCAAGCTCATCGCAAAAGGGTTAGATTCCCGTAGGCGATGCACGGCCACTAGCTCACAAAGTAGTTTCTAGCCACGGAACAATCATCCGATTATATTTTCACAAG

### Nucleotide counts

Create a function called `nucleotide_counts` to count the occurrences of each nucleotide (A, T, C, G) in a DNA sequence.

This should take in a sequence and return a dictionary where they keys are the nucleotides with the values being the counts.

### GC Content Calculation

Define a function called `gc_content` to calculate the GC content of a DNA sequence.
GC content represents the percentage of bases in a DNA sequence that are either Guanine (G) or Cytosine (C).

This should take in a sequence and then return a float of GC content.

### Reverse Complement

Create a function called `reverse_complement` to find the reverse complement of a DNA sequence.
The reverse complement is formed by reversing the sequence and replacing each base with its complementary base (A with T, C with G, and vice versa).

This should take in a sequence and then return a string of the reverse complement.