# L03 notebook

In this programming session, we will focus on the practical application of Python programming concepts in the context of biological data analysis.
The task involves creating a Python program capable of managing a dataset containing DNA sequences, sequence lengths, and organism names.
Throughout the session, we will cover fundamental Python elements, including variables, lists, dictionaries, conditional logic, loops, and functions.
The goal is to provide students with a solid understanding of how to utilize programming for the systematic analysis of biological datasets.

## Variables

In the initial phase of this coding activity, your first task is to declare variables that will serve as placeholders for essential biological data.
Specifically, we need variables to represent DNA sequences, sequence lengths, and organism names.
This step is crucial for initializing the foundation of our program.

### DNA sequences

Declare a variable to store three DNA sequences in a variable called `dna_sequences`.
These sequences are strings composed of nucleotide bases (`A`, `T`, `C`, `G`) and will be the focal point of our biological data.
Feel free to make these different lengths!

### Sequence Lengths

Create another variable called `sequence_lengths` to store the corresponding lengths of the DNA sequences.
The lengths will be integers representing the number of bases in each sequence.

### Organism names

Lastly, declare a variable called `organism_names` to store the names of the organisms associated with each DNA sequence.
These names will provide context to the biological data.

Ensure that these variables accurately reflect the biological data at hand, as they will form the basis for subsequent stages in our program.
Once you have successfully declared these variables, we can proceed to explore further Python concepts in the context of biological data analysis.

## Collecting the data

Now that we have our variables representing biological data, the next step involves organizing and managing this information systematically.
To achieve this, we will use a dictionary&mdash;a powerful data structure in Python.

### Create a DNA Data Dictionary

Introduce a dictionary named `dna_data` to comprehensively store information about each DNA sequence.
This dictionary will serve as a centralized repository, associating each sequence with its length and the respective organism.

### Get DNA information

Retrieve the sequence from one of your organisms and print it.

### Fix a correction

Whoops, you forgot to add `ATTGTGC` to the end of one of your sequences.
Fix the sequence and update the length.

## Short sequences only

As we delve deeper into our biological data, it becomes essential to introduce conditional statements that allow us to make decisions based on specific criteria.
In this section, we will implement if/else conditional logic to check if a DNA sequence exceeds a defined length threshold.

### Define the length threshold

Start by defining a length threshold that serves as the criterion for our conditional check.
This threshold will determine whether a DNA sequence is considered "long" or "short."

### Implement Conditional Logic

Utilize `if`/`else` statements to check each DNA sequence against the defined length threshold.
In this example, we'll print whether each organism's DNA sequence is considered `"long"` or `"short"`.

## Looping

Utilize a `for` loop to iterate through each organism's DNA data.
Display relevant information such as organism name, DNA sequence, and length for each iteration.

## Let's get more realistic

Do not worry too much about understanding the code below.

In [11]:
import random

bases = ["A", "T", "C", "G"]


def generate_realistic_sequence(length):
    global bases
    return "".join(random.choice(bases) for _ in range(length))


realistic_sequence = generate_realistic_sequence(10000)
print(realistic_sequence)

CACCGGAGTCCATATCTGCATCCAGGTGCACCCCCCGCGGGGGGCATAGCGTTCCGGTTGCCACTCGATCAGAGGAGCTCGACCTCCGAATAGACCTGTGTTGTGCTTGTGTGCAGAACAAACCCTCATGCTCATCCGTCCAGTAATTTGGTCCCAGGCGGTTTAATATTCAAGTGTGATGAAATCATATCCAGGGGTTAGATCCAGTGCGGTGCGAGGCCGCGTAATAACGCATGCACAGCCATCTCCCCGTGTGGGCGCAAGGGCTTGTGCCCCTGCGACTTTAGCCCGGTCGTCATTGCGGGGATTCTTGACGTGCGTAATCTGTTGCCACTCTAGGGATTCCGTGGTTTCTAAGGTGGCACCGACTGTTCCTGGAAATGGTCACTACACTACACCCTACTCTGATGGCATCTGGGTTGGGTTCTCGTTTAAAAGTGGGGACCGCGCCTTGCTCTGGGGGAAATATCCAACTGTTACGACCTCATAGACTCCACTTTCCCATGCAGGTACAAGCTCCCCCCTAGCCTTGTAATGACGATATGTATATCAGGCGGTGTTCGATAATGGCGCCCATGTCCTGAGGAGGAAGTTGTTGATTGCGAGCGAATCCTTCGGTCGGTCGCCCTGAGCACGGGCATTGCGCATATTCCCACAAACTGGGTGAGGACTAATTTATAGCATGCAGTTGCGACGCCCTGCGGTTTTAACGAGCTCTCGTATACCCAAGGCTGGACACCCATGTTTAAGGGTGATCGATCCAGCGAAGTCTAGCATCGTCAAATCGAACGTAAAGTTCCACGCCAGATTCTGCGTTTTGTAGCAATGGTATACTCGAGTGTCACTCATCTATATTATAGCGGTCCCGCGTGAAATCATGGATTTGGCGTGCCCATTTGACCAGCTGAAATATACGCGAGTTGCATGATTGTATCGAAAGTGCCAAGCGTAACCACCCGCAATGCGTGTTCTGGACTCGGTCTTGTAAAGTACACTTCAC

### Nucleotide counts

Create a function to count the occurrences of each nucleotide (A, T, C, G) in a DNA sequence.

### GC Content Calculation

Define a function to calculate the GC content of a DNA sequence.
GC content represents the percentage of bases in a DNA sequence that are either Guanine (G) or Cytosine (C).

### Reverse Complement

Create a function to find the reverse complement of a DNA sequence.
The reverse complement is formed by reversing the sequence and replacing each base with its complementary base (A with T, C with G, and vice versa).