# **Programming for Biologists: An Introduction to Biopython**

**Welcome, Biopython explorers!**

You've now mastered the basics of Python, and even delved into powerful libraries like NumPy, Pandas, and Matplotlib for general data analysis and visualization. But how do you specifically handle the unique challenges of biological data – DNA, RNA, protein sequences, complex file formats like FASTA and GenBank, and interacting with online databases?

Enter **Biopython!**

Biopython is a comprehensive collection of Python tools specifically designed for **computational biology and bioinformatics**. It provides easy-to-use interfaces to common biological data formats, command-line tools, and online databases, saving you immense time and effort.

Think of Biopython as your specialized bioinformatics lab equipment – pre-built, optimized, and ready to tackle tasks like:
* Manipulating DNA, RNA, and protein sequences.
* Parsing complex sequence files (FASTA, GenBank, PDB, etc.).
* Performing sequence alignments.
* Interacting with NCBI databases (GenBank, PubMed, BLAST).
* Working with phylogenetic trees, population genetics, and much more.

In this notebook, we'll focus on the fundamental aspects of Biopython: working with sequences and parsing common sequence files.

---

## **1. Getting Started: Installation and Core Modules**

Before you can use Biopython, you need to ensure it's installed. If you're using a fresh Python environment or Colab, you might need to run the following command (you only need to do this once per environment):

```bash
!pip install biopython
```

Then, we'll import the modules we'll be using.

In [None]:
# Run this cell if Biopython is not installed in your environment (e.g., in Google Colab or new setup)
# !pip install biopython

from Bio.Seq import Seq           # For creating and manipulating sequence objects
from Bio.SeqRecord import SeqRecord # For storing sequences with metadata (ID, description)
from Bio import SeqIO             # For parsing sequence files (FASTA, GenBank, etc.)

print("Biopython modules imported successfully!")

---

## **2. The `Seq` Object: Your Biological String**

At the heart of Biopython is the `Seq` object. While a standard Python string (`"ATGC"`) can store a sequence, the `Seq` object adds biological intelligence.

It understands that 'A' is Adenine, and can perform operations like transcription or translation directly, which a normal string cannot.

### **2.1 Creating `Seq` Objects**

You create a `Seq` object by passing a string to `Seq()`.

In [None]:
# Creating DNA, RNA, and protein sequences
dna_seq = Seq("ATGCGTACGTAGCTAGCTAGCTAGCTACGATGCATGCA")
rna_seq = Seq("AUGCAGUCAGUCAGUCAUGCAUGCUGA")
protein_seq = Seq("ARNDCEQGHI")

print(f"DNA Sequence: {dna_seq}, Type: {type(dna_seq)}")
print(f"RNA Sequence: {rna_seq}, Type: {type(rna_seq)}")
print(f"Protein Sequence: {protein_seq}, Type: {type(protein_seq)}")

# You can also specify the alphabet, though often not strictly necessary for common tasks
# from Bio.Alphabet import DNAAlphabet
# dna_seq_strict = Seq("ATGC", DNAAlphabet())
# print(f"Strict DNA Seq: {dna_seq_strict}, Type: {type(dna_seq_strict)}")

### **2.2 Basic `Seq` Operations**

`Seq` objects behave much like strings for basic operations but have powerful biological methods.

In [None]:
my_dna = Seq("ATGCGTAGCTA")

print(f"Original DNA: {my_dna}")

# Length
print(f"Length: {len(my_dna)}")

# Slicing
print(f"First 3 bases: {my_dna[0:3]}")

# Concatenation
another_dna = Seq("TCGA")
combined_dna = my_dna + another_dna
print(f"Combined DNA: {combined_dna}")

# Counting bases
print(f"Number of 'A's: {my_dna.count('A')}")
print(f"Number of 'G's: {my_dna.count('G')}")

### **2.3 Biological Operations with `Seq`**

This is where `Seq` objects truly differ from plain strings.

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"Coding DNA: {coding_dna}")

# 1. Transcription (DNA to RNA)
messenger_rna = coding_dna.transcribe()
print(f"mRNA (Transcription): {messenger_rna}")

# 2. Reverse Complement (important for working with both strands of DNA)
reverse_comp_dna = coding_dna.reverse_complement()
print(f"Reverse Complement: {reverse_comp_dna}")

# 3. Translation (mRNA to Protein)
# By default, it uses the standard genetic code and translates until a stop codon or end of sequence.
# Note: The output might include '*' for stop codons.
amino_acids = messenger_rna.translate()
print(f"Amino Acids (Translation): {amino_acids}")

# You can also translate directly from DNA
protein_from_dna = coding_dna.translate()
print(f"Protein from DNA (Translation): {protein_from_dna}")

# Translate using a different genetic code (e.g., fungal mitochondrial code = 4)
# print(f"Protein (Fungal Mito): {coding_dna.translate(table=4)}")

**Your Turn! (Seq Exercise)**

1.  Create a `Seq` object for the following DNA sequence: `"GATGGAACTGA"`.
2.  Get its reverse complement and print it.
3.  Transcribe the original DNA sequence to RNA and print it.
4.  Translate the resulting RNA sequence into a protein sequence and print it.

In [None]:
# Write your code for Seq Exercise here!


---

## **3. The `SeqRecord` Object: Sequence with Identity**

While `Seq` handles the sequence itself, real-world biological data often comes with important metadata: an ID, a name, a description, and sometimes features (like gene locations). The `SeqRecord` object bundles a `Seq` object with all this descriptive information.

Think of a `SeqRecord` as a complete entry from a FASTA or GenBank file.

In [None]:
# Create a Seq object first
my_gene_seq = Seq("ATGCATGCAACTGACGTAGCTAGCTAGC")

# Create a SeqRecord from the Seq object and add metadata
record = SeqRecord(
    my_gene_seq,
    id="gene_001",
    name="gene_A",
    description="Hypothetical protein from E. coli K12"
)

print(f"Record ID: {record.id}")
print(f"Record Name: {record.name}")
print(f"Record Description: {record.description}")
print(f"Record Sequence: {record.seq}")
print(f"Type of record.seq: {type(record.seq)}")

# You can also add annotations (key-value pairs) and features (more complex biological annotations)
record.annotations["organism"] = "Escherichia coli"
record.annotations["source"] = "GenBank"
print(f"\nRecord annotations: {record.annotations}")

# Access a specific annotation
print(f"Organism: {record.annotations['organism']}")

**Your Turn! (SeqRecord Exercise)**

1.  Create a `Seq` object for the protein sequence: `"MKLVLALLS"`.
2.  Create a `SeqRecord` object for this protein with:
    * `id`: `"PROT_007"`
    * `name`: `"signal_peptide_7"`
    * `description`: `"A putative signal peptide from a viral protein"`
3.  Add an annotation to this `SeqRecord` with `"function"` as the key and `"cellular trafficking"` as the value.
4.  Print the `id`, `description`, and the `function` annotation of your `SeqRecord`.

In [None]:
# Write your code for SeqRecord Exercise here!


---

## **4. Parsing Sequence Files with `Bio.SeqIO`**

This is perhaps the most heavily used part of Biopython. `Bio.SeqIO` provides a simple way to read and write sequence files in various formats (FASTA, GenBank, FASTQ, EMBL, Nexus, etc.).

### **4.1 Creating a Sample FASTA File for Parsing**

To demonstrate `SeqIO`, let's create a temporary FASTA file just like we did in previous notebooks.

In [None]:
# Define the content for our sample FASTA file
fasta_content = """
>gene_alpha | E. coli gene for hypothetical protein
ATGCGTACGTAGCTAGCTAGCTAGCTACGATGCATGCA
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGTACG
>gene_beta | Human ribosomal RNA fragment
AGUUCAGUCAGUCAGUCAUGCAUGCUGA
>gene_gamma | Viral spike protein
ATGGCAAGGTTTCAGGTAACCAAA
GGGTTTCCCG
"""

# Name of our sample FASTA file
sample_fasta_file = "sample_bioseq.fasta"

# Write the content to the file
with open(sample_fasta_file, "w") as f:
    f.write(fasta_content)

print(f"Successfully created '{sample_fasta_file}' for demonstration.")

### **4.2 Reading Records from a FASTA File**

The `SeqIO.parse()` function is your primary tool. It takes the file handle and the format string (e.g., `'fasta'`, `'genbank'`) and returns an *iterator* of `SeqRecord` objects. An iterator means it gives you one record at a time, which is memory-efficient for very large files.

You can loop through this iterator to process each record.

In [None]:
print(f"--- Parsing '{sample_fasta_file}' ---")

parsed_records = [] # We'll store all records in a list for later access

with open(sample_fasta_file, "r") as handle:
    # SeqIO.parse returns an iterator of SeqRecord objects
    for record in SeqIO.parse(handle, "fasta"):
        print(f"\nRecord ID: {record.id}")
        print(f"Record Name: {record.name}")
        print(f"Record Description: {record.description}")
        print(f"Sequence Length: {len(record.seq)}")
        print(f"Sequence (first 20 bases): {record.seq[:20]}...")
        
        # You can now perform biological operations on record.seq
        if len(record.seq) > 0 and record.seq.alphabet.nucleotide:
            print(f"  Reverse Complement: {record.seq.reverse_complement()[:20]}...")
            
        parsed_records.append(record)

print(f"\nTotal records parsed: {len(parsed_records)}")

### **4.3 Writing Records to a New FASTA File**

Just as easy as reading, you can write `SeqRecord` objects to a file using `SeqIO.write()`.

In [None]:
output_fasta_file = "modified_sequences.fasta"

# Let's modify a record before writing it out (e.g., take reverse complement of first gene)
if len(parsed_records) > 0:
    first_record = parsed_records[0]
    original_seq = first_record.seq
    first_record.seq = original_seq.reverse_complement() # Change the sequence
    first_record.description = f"Reverse complement of {first_record.description}" # Update description
    first_record.id = first_record.id + "_RC"
    first_record.name = first_record.name + "_RC"

    # Write all records (including the modified first one) to a new file
    with open(output_fasta_file, "w") as output_handle:
        SeqIO.write(parsed_records, output_handle, "fasta")
    print(f"\nSuccessfully wrote {len(parsed_records)} records to '{output_fasta_file}'.")
    
    # Verify by reading the new file
    print(f"\nContent of '{output_fasta_file}':")
    with open(output_fasta_file, "r") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            print(f"ID: {record.id}, Length: {len(record.seq)}, Desc: {record.description[:50]}...")
else:
    print("No records to write. Please ensure parsing step was successful.")

**Your Turn! (SeqIO Exercise)**

1.  Create a new dummy FASTA file named `my_proteins.fasta` with at least two protein sequences. Example:
    ```
    >protein_A | Hypothetical protein
    MKNFTKGAIL
    >protein_B | Structural protein
    MVLSPADKTN
    ```
2.  Parse this `my_proteins.fasta` file using `SeqIO.parse()`.
3.  For each protein record:
    * Print its ID.
    * Print its sequence.
    * Print the length of its sequence.
4.  Write all these protein records to a new FASTA file named `processed_proteins.fasta`.

In [None]:
# Write your code for SeqIO Exercise here!

# 1. Create a dummy protein FASTA file


# 2. Parse the protein FASTA file


# 4. Write records to a new FASTA file


# Optional: Verify content of the new file


---

This is a fundamental toolkit for almost any bioinformatics task involving sequence data. Biopython is vast and offers modules for many other areas, including:

* **`Bio.Blast`**: Running and parsing BLAST results.
* **`Bio.Entrez`**: Interacting with NCBI databases (GenBank, PubMed, SRA, etc.) to fetch data programmatically.
* **`Bio.Align` / `Bio.pairwise2`**: Performing sequence alignments.
* **`Bio.PDB`**: Working with protein structures.
* **`Bio.Phylo`**: Constructing and analyzing phylogenetic trees.