# Codon Optimization & Deoptimization

This notebook demonstrates how to use the `tasep_models` library to:

1. **Optimize** a coding sequence by replacing each codon with the most frequently used synonymous codon in the human genome
2. **Deoptimize** a sequence by using the least frequently used synonymous codons
3. **Calculate the Codon Adaptation Index (CAI)** to measure translation efficiency
4. **Visualize codon usage** along the gene using sliding window analysis

---

## Background

**Codon Adaptation Index (CAI)** is a measure of the relative adaptiveness of the codon usage of a gene towards the codon usage of highly expressed genes. Higher CAI values indicate codons that are more commonly used in highly expressed genes, which can lead to faster and more efficient translation.

- **CAI = 1.0**: Optimal codon usage (all codons are the most frequently used)
- **CAI â†’ 0**: Poor codon usage (rare codons are used)

### Applications
- **Optimize**: Maximize protein expression by using frequently used codons
- **Deoptimize**: Reduce translation speed for cotranslational folding studies, attenuated vaccines, etc.

## 1. Setup and Imports

In [None]:
# Install the library if needed:
#   pip install tasep_models
# OR for development:
#   pip install -e /path/to/tasep_models

import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Import tasep_models functions
import tasep_models as tm
from tasep_models import (
    read_sequence,
    compute_CAI,
    sliding_window_cai,
    plot_sliding_window_cai,
    optimize_sequence,
    deoptimize_sequence,
    calculate_codon_usage,
    plot_codon_usage_grouped,
    plot_plasmid,
    U_TAG,
    HA_TAG,
)

print(f"tasep_models version: {tm.__version__}")

## 2. Load a Gene Sequence

We'll load a plasmid sequence from a SnapGene `.dna` file. The `read_sequence` function extracts the protein-coding region and identifies tag positions.

In [None]:
# Path to gene sequence file (SnapGene .dna format)
# Available sequences in data/human_genome/gene_sequences/utag_project/:
#   - pNZ208(pUB-24xUTagFullLength-KDM5B-MS2).dna  (U-TAG)
#   - pNZ266(pUB-24xGCN4-KDM5B-MS2).dna           (GCN4/SunTag)
#   - pNZ267 (pUB-24xALFAtag-KDM5B-MS2).dna       (ALFA tag)

# Set the path to your gene sequence
data_dir = Path('..') / 'data' / 'human_genome' / 'gene_sequences' / 'utag_project'
dna_file = data_dir / 'pNZ208(pUB-24xUTagFullLength-KDM5B-MS2).dna'

# Read the sequence with U-TAG detection
tag_sequence = U_TAG
protein, rna, dna, indexes_tags, indexes_pauses, seq_record, graphic_features = read_sequence(
    seq=dna_file,
    min_protein_length=50,
    TAG=[tag_sequence]
)

print(f"Gene file: {dna_file.name}")
print(f"RNA sequence length: {len(rna)} nucleotides")
print(f"Protein length: {len(protein)} amino acids")
print(f"Number of codons: {len(rna) // 3}")
print(f"Tag positions: {indexes_tags[0] if indexes_tags else 'None'}")

In [None]:
# Visualize the plasmid map
plasmid_fig = plot_plasmid(seq_record, graphic_features, figure_width=20, figure_height=4)

## 3. Calculate Original CAI

The **Codon Adaptation Index (CAI)** measures how well the codon usage of a gene matches the highly expressed genes of the organism.

In [None]:
# Calculate CAI for the original sequence
original_cai = compute_CAI(rna)
print(f"Original sequence CAI: {original_cai:.4f}")

## 4. Optimize the Sequence

**Optimization** replaces each codon with the most frequently used synonymous codon in the human genome. This can increase translation speed and protein expression levels.

In [None]:
# Optimize the sequence
optimized_rna = optimize_sequence(rna)

# Calculate CAI for optimized sequence
optimized_cai = compute_CAI(optimized_rna)

print(f"Original CAI:  {original_cai:.4f}")
print(f"Optimized CAI: {optimized_cai:.4f}")
print(f"CAI improvement: {(optimized_cai - original_cai) / original_cai * 100:.1f}%")
print(f"\nOptimized sequence length: {len(optimized_rna)} nt (same as original: {len(optimized_rna) == len(rna)})")

## 5. Deoptimize the Sequence

**Deoptimization** replaces each codon with the least frequently used synonymous codon. This can:
- Slow down translation to allow cotranslational folding
- Create attenuated vaccines
- Study ribosome pausing effects

In [None]:
# Deoptimize the sequence
deoptimized_rna = deoptimize_sequence(rna)

# Calculate CAI for deoptimized sequence
deoptimized_cai = compute_CAI(deoptimized_rna)

print(f"Original CAI:     {original_cai:.4f}")
print(f"Deoptimized CAI:  {deoptimized_cai:.4f}")
print(f"CAI reduction: {(original_cai - deoptimized_cai) / original_cai * 100:.1f}%")

## 6. Compare All Three Sequences

Let's visualize the CAI profile along the gene using sliding window analysis.

In [None]:
# Calculate sliding window CAI for all three sequences
window_size = 30  # codons
step = 1  # codons

pos_orig, cai_orig = sliding_window_cai(rna, window_size=window_size, step=step)
pos_opt, cai_opt = sliding_window_cai(optimized_rna, window_size=window_size, step=step)
pos_deopt, cai_deopt = sliding_window_cai(deoptimized_rna, window_size=window_size, step=step)

# Plot comparison
fig, ax = plt.subplots(figsize=(14, 5))

ax.plot(pos_orig, cai_orig, 'b-', linewidth=2, alpha=0.8, label=f'Original (CAI={original_cai:.3f})')
ax.plot(pos_opt, cai_opt, 'g-', linewidth=2, alpha=0.8, label=f'Optimized (CAI={optimized_cai:.3f})')
ax.plot(pos_deopt, cai_deopt, 'r-', linewidth=2, alpha=0.8, label=f'Deoptimized (CAI={deoptimized_cai:.3f})')

ax.set_xlabel('Position (codons)', fontsize=14)
ax.set_ylabel('CAI', fontsize=14)
ax.set_title(f'Codon Adaptation Index Along the Gene (Window: {window_size} codons)', fontsize=16)
ax.legend(loc='lower right', fontsize=12)
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n{'='*50}")
print(f"SUMMARY")
print(f"{'='*50}")
print(f"Original CAI:     {original_cai:.4f}")
print(f"Optimized CAI:    {optimized_cai:.4f} (+{(optimized_cai/original_cai - 1)*100:.1f}%)")
print(f"Deoptimized CAI:  {deoptimized_cai:.4f} ({(deoptimized_cai/original_cai - 1)*100:.1f}%)")

## 7. Codon Usage Analysis

Compare the codon usage frequencies between the original and modified sequences.

In [None]:
# Calculate codon usage for original sequence
codon_counts_orig = calculate_codon_usage(rna)
codon_counts_opt = calculate_codon_usage(optimized_rna)
codon_counts_deopt = calculate_codon_usage(deoptimized_rna)

print("Original sequence codon usage:")
plot_codon_usage_grouped(rna)

In [None]:
print("Optimized sequence codon usage:")
plot_codon_usage_grouped(optimized_rna)

In [None]:
print("Deoptimized sequence codon usage:")
plot_codon_usage_grouped(deoptimized_rna)

## 8. Work with Custom Sequences

You can also optimize/deoptimize any custom DNA or RNA sequence:

In [None]:
# Example: Optimize a short custom sequence
custom_sequence = "ATGAAATTTGGGCCCAAAGGG"  # 7 codons

print(f"Custom sequence: {custom_sequence}")
print(f"Original CAI: {compute_CAI(custom_sequence):.4f}")

# Optimize
opt_custom = optimize_sequence(custom_sequence)
print(f"\nOptimized:     {opt_custom}")
print(f"Optimized CAI: {compute_CAI(opt_custom):.4f}")

# Deoptimize
deopt_custom = deoptimize_sequence(custom_sequence)
print(f"\nDeoptimized:     {deopt_custom}")
print(f"Deoptimized CAI: {compute_CAI(deopt_custom):.4f}")

## 9. Export Sequences

Save the optimized/deoptimized sequences to files:

In [None]:
# Create output directory
output_dir = Path('codon_optimization_results')
output_dir.mkdir(exist_ok=True)

# Save sequences in FASTA format
def save_fasta(filename, header, sequence):
    with open(filename, 'w') as f:
        f.write(f">{header}\n")
        # Write sequence in lines of 60 characters
        for i in range(0, len(sequence), 60):
            f.write(sequence[i:i+60] + "\n")

gene_name = dna_file.stem

save_fasta(output_dir / f"{gene_name}_original.fasta", 
           f"{gene_name} original CAI={original_cai:.4f}", rna)
save_fasta(output_dir / f"{gene_name}_optimized.fasta", 
           f"{gene_name} optimized CAI={optimized_cai:.4f}", optimized_rna)
save_fasta(output_dir / f"{gene_name}_deoptimized.fasta", 
           f"{gene_name} deoptimized CAI={deoptimized_cai:.4f}", deoptimized_rna)

print(f"Sequences saved to: {output_dir.absolute()}")
print(f"  - {gene_name}_original.fasta")
print(f"  - {gene_name}_optimized.fasta")
print(f"  - {gene_name}_deoptimized.fasta")

---

## Summary

This notebook demonstrated:

1. **Loading gene sequences** from SnapGene files using `read_sequence()`
2. **Computing CAI** with `compute_CAI()` to measure codon adaptation
3. **Optimizing sequences** with `optimize_sequence()` for maximum translation efficiency
4. **Deoptimizing sequences** with `deoptimize_sequence()` for slow translation studies
5. **Visualizing CAI profiles** with `sliding_window_cai()` and custom plots
6. **Analyzing codon usage** with `calculate_codon_usage()` and `plot_codon_usage_grouped()`

### Key Functions Reference

| Function | Description |
|----------|-------------|
| `compute_CAI(sequence)` | Calculate Codon Adaptation Index |
| `optimize_sequence(sequence)` | Replace codons with most-used synonyms |
| `deoptimize_sequence(sequence)` | Replace codons with least-used synonyms |
| `sliding_window_cai(seq, window_size, step)` | CAI along the gene |
| `plot_sliding_window_cai(...)` | Visualize sliding window CAI |
| `calculate_codon_usage(sequence)` | Count codon frequencies |
| `plot_codon_usage_grouped(sequence)` | Visualize codon usage by amino acid |