# **Programming for Biologists 1: Your First Steps with Python**


Imagine you're in the lab, faced with stacks of data: hundreds of DNA sequences, thousands of gene expression values, or a massive list of protein IDs. Doing everything by hand would take forever, and mistakes are easy to make!

That's where **programming** comes in. It's like teaching your computer to be your super-fast, super-accurate lab assistant. You give it instructions, and it follows them perfectly, tirelessly, and quickly.

In this notebook, we'll start with **Python**, a very popular and easy-to-learn programming language. Biologists worldwide love Python because it's fantastic for handling biological data, crunching numbers, and automating tedious tasks.

Think of this Jupyter Notebook as your interactive lab notebook, where you can write code (your instructions) and see the results immediately.

---

## **1. Python's Basic Building Blocks: What are 'Words' and 'Numbers' to a Computer?**

Just like learning a new language, you start with basic words and sentences. In Python, we call these **data types**.

### **1.1 Text Information (Strings)**

In biology, we work with a lot of text: DNA sequences, protein names, gene descriptions. In Python, a sequence of characters (like "ATGC") is called a **string**.

You tell Python it's a string by putting **quotes** around it (either single `''` or double `""`).

In [None]:
# We can print (display) these strings
print("ATGCGTACGTACGTACGTACG")
print('BRCA1')

**Why this matters:** You'll use strings to store DNA, RNA, protein sequences, file names, sample IDs, and any descriptive text.

### **1.2 Number Information (Integers & Floats)**

Numbers are straightforward. Python has two main types for numbers:

* **Integers (`int`):** Whole numbers, like `10`, `100`, `-5`. (e.g., number of genes, count of mutations)
* **Floats (`float`):** Numbers with decimal points, like `3.14`, `0.5`, `100.0`. (e.g., pH values, concentrations, p-values)

In [None]:
# An integer (whole number)
number_of_genes = 20000

# A float (decimal number)
ph_value = 7.4

# We can print numbers too
print(number_of_genes)
print(ph_value)

**Why this matters:** You'll use numbers for counts, measurements, concentrations, statistical values, and much more.

### **1.3 Giving Names to Your Data (Variables)**

Instead of just writing `20000` or `"ATGC"`, we can give these pieces of information a *name*. These names are called **variables**.

Think of a variable as a labeled box where you can store a piece of information. You can change what's inside the box later.

In [None]:
# Store a DNA sequence in a variable called 'my_dna'
my_dna = "ATGCGTACGTACGTACGTACG"
print("My DNA sequence is:", my_dna)

# Store a count in a variable called 'gene_count'
gene_count = 500
print("The gene count is:", gene_count)

# You can even change what's in the box!
gene_count = gene_count + 100 # Add 100 to the existing count
print("New gene count:", gene_count)

**Why this matters :** Variables make your code readable and flexible. You can refer to `patient_id` instead of "patient's unique identifier," and reuse the same data multiple times without retyping it.

### **1.4 Basic Math and Text Tricks**

Python can do simple math and even some tricks with text!

* **Math operations:** `+` (add), `-` (subtract), `*` (multiply), `/` (divide)
* **Text operations:** `+` (joins text), `*` (repeats text)

In [None]:
# Math with numbers
total_cells = 100 + 50
print("Total cells:", total_cells)

percentage = (15 / 60) * 100
print("Percentage:", percentage)

# Combining (concatenating) text
first_part = "ATG"
second_part = "CGC"
full_sequence = first_part + second_part
print("Combined sequence:", full_sequence)

# Repeating text (e.g., for simple repeats in DNA)
repeat_unit = "GA"
long_repeat = repeat_unit * 5
print("Long repeat:", long_repeat)

**Your Turn! (Exercise 1)**

1.  Create a variable called `protein_sequence` and store a short protein sequence (e.g., `"ARNDCE"`) in it.
2.  Create a variable called `molecular_weight` and store a decimal number (e.g., `12345.67`) in it.
3.  Print both variables.
4.  Combine two short DNA fragments (e.g., `"GATT"` and `"ACCA"`) into a new variable called `exon_fragment` and print it.

In [None]:
# Write your code for Exercise 1 here!


### Basic Arithmetic Operations

| Operator | Description           | Example           |
|----------|-----------------------|-------------------|
| `+`      | Addition              | `3 + 2 → 5`       |
| `-`      | Subtraction           | `5 - 3 → 2`       |
| `*`      | Multiplication        | `4 * 2 → 8`       |
| `/`      | Division              | `8 / 2 → 4.0`     |
| `//`     | Floor Division        | `7 // 2 → 3`      |
| `%`      | Modulus (Remainder)   | `7 % 2 → 1`       |
| `**`     | Exponentiation (Power)| `3 ** 2 → 9`      |


---

## **2. Organizing Your Lab Samples: Lists and Dictionaries**

In the lab, you don't just have one sample; you have *many* samples, *many* genes, *many* experiments. Python has special ways to store collections of data.

### **2.1 Lists: Your Ordered Shopping List of Data**

A **list** is like an ordered shopping list. You can put many items into it, and they stay in the order you put them. You can also add or remove items from your list later.

Lists are defined using **square brackets `[]`** and items are separated by **commas `,`**.

In [None]:
# A list of gene names
gene_names = ["BRCA1", "TP53", "EGFR", "MYC"]
print("My gene list:", gene_names)

# A list of expression levels
expression_values = [12.5, 30.1, 5.8, 22.0]
print("Expression values:", expression_values)

In [None]:
# Accessing items in a list (remember, Python starts counting from 0!)
# The first item is at index 0, second at 1, and so on.
first_gene = gene_names[0] # This gets "BRCA1"
print("First gene:", first_gene)

third_expression = expression_values[2] # This gets 5.8
print("Third expression value:", third_expression)

# Adding items to your list
gene_names.append("VEGFA") # Adds "VEGFA" to the end of the list
print("Updated gene list:", gene_names)

# How many items are in my list?
num_genes = len(gene_names)
print("Number of genes in list:", num_genes)

**Why this matters:** Lists are perfect for storing lists of sequences, file paths, sample IDs, experiment conditions, or any series of data.

### **2.2 Dictionaries: Your Lab Notebook Index**

Imagine your lab notebook has an index. You look up a "gene name" (the key) and it tells you the "sequence" (the value).

A **dictionary** in Python works just like this. It stores information as **key-value pairs**. Each `key` is unique (like a gene name), and it points to a `value` (like its sequence).

Dictionaries are defined using **curly braces `{}`** with `key: value` pairs separated by commas.

In [None]:
# A dictionary mapping gene names (keys) to their sequences (values)
gene_sequences = {
    "BRCA1": "ATGCCGTA...",
    "TP53": "GGCATTCG...",
    "EGFR": "AAAGGGTT..."
}
print("My gene sequences dictionary:", gene_sequences)

In [None]:
# Getting the sequence for a specific gene
brca1_seq = gene_sequences["BRCA1"]
print("BRCA1 sequence:", brca1_seq)

# Adding a new gene to the dictionary
gene_sequences["MYC"] = "TTTCCCAT..."
print("Updated dictionary with MYC:", gene_sequences)

# Getting all the gene names (keys)
all_gene_names = gene_sequences.keys()
print("All gene names:", list(all_gene_names)) # Convert to list to see them clearly

**Why this matters:** Dictionaries are super useful for mapping IDs to data (e.g., patient ID to phenotype, gene name to expression level, protein PDB ID to structure file).

**Your Turn! (Exercise 2)**

1.  Create a list called `cell_types` with at least three different cell types (e.g., "Neuron", "Macrophage", "Epithelial").
2.  Add a new cell type to your `cell_types` list.
3.  Print the second cell type in your list.
4.  Create a dictionary called `patient_ages` where keys are patient IDs (e.g., "P001", "P002") and values are their ages.
5.  Print the age of patient "P001" from your dictionary.

In [None]:
# Write your code for Exercise 2 here!



---

## **3. Smart Programs: Making Decisions and Doing Repetitive Tasks**

This is where programming becomes truly powerful. You can teach your computer to make decisions and to do things over and over again without getting bored!

### **3.1 The 'If This, Then That' Rule (Conditional Statements)**

Sometimes, your program needs to do different things based on different conditions. This is handled by `if`, `elif` (else if), and `else` statements.

In [None]:
dna_length = 250

if dna_length > 1000:
    print("This is a very long DNA sequence.")
elif dna_length > 200: # This runs only if the first 'if' was False
    print("This is a medium-sized DNA sequence.")
else: # This runs if none of the above 'if' or 'elif' conditions were True
    print("This is a short DNA sequence.")

# You can also check for exact matches, or if something is NOT true
gene_status = "active"
if gene_status == "active": # == checks if two things are exactly equal
    print("Gene is switched on.")
else:
    print("Gene is switched off or unknown.")

**Why this matters:** You'll use `if/else` to filter data (e.g., "if gene expression is high, then...", "if mutation is present, then..."), validate inputs, and control program flow based on experimental conditions.

### **3.2 Doing Things for Every Single Sample (Loops)**

Imagine you have a list of 100 gene sequences, and you want to calculate the GC content for each one. Doing it individually would be tedious. That's where **loops** come in.

A `for` loop lets you go through each item in a list (or string, or other collection) and do something with it.

In [None]:
gene_list = ["BRCA1", "TP53", "EGFR", "MYC"]

print("--- Processing each gene ---")
for gene in gene_list: # 'gene' will be "BRCA1", then "TP53", etc., in each round
    print("Currently processing:", gene)
    # Here you would put code to do something with 'gene', like fetching its sequence

print("--- Done processing all genes ---")

# You can also loop through characters in a sequence
my_dna = "ATGCGT"
print("\n--- Counting bases in DNA ---")
a_count = 0
for base in my_dna:
    if base == 'A':
        a_count = a_count + 1 # Add 1 to a_count if the base is 'A'
    print(base) # Print each base as we go through it
print("Total 'A's:", a_count)

**Why this matters:** Loops are fundamental for processing large datasets. You'll loop through rows in a data file, sequences in a FASTA file, samples in an experiment, etc.

**Your Turn! (Exercise 3)**

1.  You have a list of patient temperatures: `patient_temps = [37.2, 39.5, 36.8, 38.1]`.
2.  Use a `for` loop to go through each temperature.
3.  Inside the loop, use an `if` statement to check:
    * If the temperature is greater than `38.0`, print: "Patient has a fever: \[temperature]"
    * Otherwise, print: "Patient temperature is normal: \[temperature]"

In [None]:
# Write your code for Exercise 3 here!


---

## **4. Building Your Own Lab Tools (Functions)**

As you write more code, you'll find yourself doing the same set of steps over and over again. Instead of writing them out every time, you can package them into a reusable "tool" called a **function**.

Think of a function like a pre-made lab protocol (e.g., a PCR protocol). You give it some ingredients (inputs), it follows the steps, and it gives you a final product (output).

You define a function using the `def` keyword, give it a name, and list its `inputs` in parentheses.

In [None]:
# A simple function to say hello
def greet_person(name):
    """This function greets a person by name.""" # This is a "docstring" - explains what the function does
    print("Hello,", name + "!")

# Now, use our new tool!
greet_person("Dr. Smith")
greet_person("Lab Assistant Bot")

In [None]:
# A function to calculate GC content
def calculate_gc_content(dna_sequence):
    """Calculates the percentage of G and C bases in a DNA sequence."""
    g_count = dna_sequence.count('G') # Python has built-in ways to count characters!
    c_count = dna_sequence.count('C')
    total_bases = len(dna_sequence) # len() tells us the length of a string/list

    if total_bases == 0: # Avoid dividing by zero!
        return 0.0 # Return 0% if the sequence is empty

    gc_percent = ((g_count + c_count) / total_bases) * 100
    return gc_percent # This is what the function "gives back" as its result

# Let's test our GC content tool
seq1 = "ATGCGCGCAT"
gc1 = calculate_gc_content(seq1)
print(f"GC content of '{seq1}' is: {gc1:.2f}%") # The .2f means format to 2 decimal places

seq2 = "AAAAAA"
gc2 = calculate_gc_content(seq2)
print(f"GC content of '{seq2}' is: {gc2:.2f}%")

**Why this matters :** Functions are crucial for organizing your code, making it reusable, and avoiding repetitive work. You'll write functions for things like parsing sequence headers, calculating reverse complements, or analyzing specific regions of a genome.

**Your Turn! (Exercise 4)**

1.  Write a function called `celsius_to_fahrenheit` that takes one input: `celsius_temp`.
2.  Inside the function, calculate Fahrenheit using the formula: `F = (C * 9/5) + 32`.
3.  The function should `return` the calculated Fahrenheit temperature.
4.  Call your function with a Celsius temperature (e.g., `25`) and print the result.

In [None]:
# Write your code for Exercise 4 here!


---

## **5. Working with Your Lab Data Files (File Input/Output)**

Your data isn't usually typed directly into Python; it comes from files. Python is excellent at reading from and writing to files.

### **5.1 Opening Your Lab Records (Reading Files)**

To read a file, you "open" it. The `with open(...)` structure is the safest way to do this, as it automatically closes the file when you're done, even if errors happen.

In [None]:
# --- Create a dummy file ---
# In a real scenario, you'd already have a 'dna_data.txt' file
with open("dna_data.txt", "w") as f:
    f.write(">gene_A\nATGCATGCATGC\n")
    f.write(">gene_B\nGGCCGGCCGGCC\n")
# ------------------------------------------------------------------------------------------

print("--- Reading DNA data from file ---")
# Open the file for reading ('r' means read mode)
with open("dna_data.txt", "r") as dna_file:
    # Read each line from the file
    for line in dna_file:
        # Each 'line' will still have the newline character at the end ('\n')
        # We often strip it off for cleaner data
        cleaned_line = line.strip()
        print("Read line:", cleaned_line)

# Let's say you have a FASTA file (common biological sequence format)
# It has lines starting with '>' (header) and then the sequence
# We can combine reading with our previous knowledge:


In [None]:
# --- Create a dummy FASTA file ---
with open("genes.fasta", "w") as f:
    f.write(">gene1_description\nATGCATGC\n")
    f.write(">gene2_description\nGGCCGGCC\n")
    f.write(">gene3_description\nTTTAAA\n")
# ----------------------------------

print("\n--- Processing a simple FASTA file ---")
gene_sequences_dict = {} # We'll store our sequences in a dictionary
current_gene_id = ""

with open("genes.fasta", "r") as fasta_file:
    for line in fasta_file:
        cleaned_line = line.strip()
        if cleaned_line.startswith('>'): # If it's a header line
            current_gene_id = cleaned_line[1:] # Remove the '>'
            gene_sequences_dict[current_gene_id] = "" # Initialize empty string for this gene
        else: # If it's a sequence line
            gene_sequences_dict[current_gene_id] += cleaned_line # Add sequence to current gene

print("Loaded sequences:")
for gene_id, sequence in gene_sequences_dict.items():
    print(f"Gene ID: {gene_id}, Sequence: {sequence}")

**Why this matters :** This is how you'll get data from public databases, sequencing results, or any other external files into your Python program for analysis.

### **5.2 Writing Files**

After you process data, you'll often want to save your results to a new file.

To write a file, you open it in write mode (`'w'`) or append mode (`'a'` to add to an existing file without deleting its content).

In [None]:
results_to_save = [
    "GeneX,GC_content=65%",
    "GeneY,GC_content=40%",
    "GeneZ,GC_content=52%"
]

print("--- Writing analysis results to a file ---")
with open("analysis_results.txt", "w") as outfile: # 'w' will create/overwrite the file
    outfile.write("Gene ID,GC Content\n") # Write a header line
    for result_line in results_to_save:
        outfile.write(result_line + "\n") # Write each result, add a newline character

print("Results saved to analysis_results.txt")

# You can check the file's content after running this cell by going to your Jupyter/Colab file browser,
# or by reading it back into the notebook:
print("\n--- Checking the written file content ---")
with open("analysis_results.txt", "r") as check_file:
    for line in check_file:
        print(line.strip())

You'll use this to save filtered gene lists, computed statistics, formatted output for other software, or custom reports.

**Your Turn! (Exercise 5)**

1.  Imagine you have a list of gene names that passed a filter: `filtered_genes = ["ABC1", "XYZ2", "DEF3"]`.
2.  Write these gene names, one per line, into a new file called `high_confidence_genes.txt`.

In [None]:
# Write your code for Exercise 5 here!



---

## **6. When Things Go Wrong: Basic Error Handling**

Even the best experiments (and programs!) can have unexpected issues. Python tries to tell you when something is wrong with an "error message." You can make your programs more robust by anticipating common problems.

A common way to handle errors is using `try` and `except`.

* **`try`:** "Try to run this code."
* **`except`:** "IF an error happens during the `try` block, THEN do this instead of crashing."

In [None]:
# Example: Trying to divide by zero (a common error)
numerator = 10
denominator = 0 # Uh oh!

try:
    result = numerator / denominator
    print("Result:", result)
except ZeroDivisionError: # Catch a specific type of error
    print("Oops! You tried to divide by zero. That's not allowed in math.")
except Exception as e: # Catch any other error that might occur
    print(f"An unexpected error occurred: {e}")

print("Program continued gracefully!")

# Example: Trying to open a file that doesn't exist
file_name = "non_existent_file.txt"
try:
    with open(file_name, "r") as f:
        content = f.read()
        print("File content:", content)
except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found. Please check the name and path.")
except Exception as e:
    print(f"An unexpected error occurred while opening the file: {e}")

**Why this matters:** Robust scripts can handle missing files, malformed data, or unexpected values without crashing, making them more reliable for real-world lab data.

---