# Annotate Genomes with RAST

**Parent**: CDMSCI-193 - RBTnSeq Modeling Analysis

**Ticket**: CDMSCI-198 - Build Genome-Scale Metabolic Models

## Objective

Annotate protein sequences for 57 organisms using RAST (Rapid Annotation using Subsystem Technology) to prepare for metabolic model building.

## What is RAST?

RAST is an online service that annotates bacterial genomes with functional roles, which are then used to map genes to metabolic reactions in ModelSEED templates.

**Key Features**:
- Assigns functional roles to genes/proteins
- Maps roles to biochemical reactions
- Required for ModelSEEDpy model building
- Takes 2-4 hours per genome

## Inputs

- Protein FASTA files (57 organisms)
- Location: `../data/raw/protein_sequences/`

## Outputs

- Annotated genome objects saved as pickle files
- Location: `results/genomes/`
- Format: `{organism_id}_genome.pkl`

**Last updated**: 2025-10-07

## Setup

In [1]:
import modelseedpy
from modelseedpy import MSGenome, RastClient
from pathlib import Path
import pickle
import time
from datetime import datetime

print("Imports successful")
print(f"ModelSEEDpy version: {modelseedpy.__version__ if hasattr(modelseedpy, '____version__') else 'dev'}")

modelseedpy 0.4.2
Imports successful
ModelSEEDpy version: dev


## Configuration

In [2]:
# Paths
FASTA_DIR = Path('../data/raw/protein_sequences')
OUTPUT_DIR = Path('results/genomes')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# FASTA output directory
FASTA_OUTPUT_DIR = Path('results/fasta_annotated')
FASTA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

LOG_FILE = Path('results/rast_annotation_log.txt')

print(f"Configuration set")
print(f"  Input directory: {FASTA_DIR}")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  FASTA output directory: {FASTA_OUTPUT_DIR}")
print(f"  Log file: {LOG_FILE}")

Configuration set
  Input directory: ../data/raw/protein_sequences
  Output directory: results/genomes
  FASTA output directory: results/fasta_annotated
  Log file: results/rast_annotation_log.txt


## Find FASTA Files

In [3]:
print("Scanning for FASTA files...")
fasta_files = sorted(FASTA_DIR.glob('*_proteins.fasta'))

print(f"\nFound {len(fasta_files)} FASTA files:")
for i, fasta_file in enumerate(fasta_files, 1):
    print(f"  {i:2d}. {fasta_file.name}")

Scanning for FASTA files...

Found 57 FASTA files:
   1. ANA3_proteins.fasta
   2. BFirm_proteins.fasta
   3. Bifido_proteins.fasta
   4. Brev2_proteins.fasta
   5. Btheta_proteins.fasta
   6. Burk376_proteins.fasta
   7. Burkholderia_OAS925_proteins.fasta
   8. Bvulgatus_CL09T03C04_proteins.fasta
   9. CL21_proteins.fasta
  10. Caulo_proteins.fasta
  11. Cola_proteins.fasta
  12. Cup4G11_proteins.fasta
  13. Dda3937_proteins.fasta
  14. Ddia6719_proteins.fasta
  15. DdiaME23_proteins.fasta
  16. Dino_proteins.fasta
  17. DvH_proteins.fasta
  18. Dyella79_proteins.fasta
  19. HerbieS_proteins.fasta
  20. Kang_proteins.fasta
  21. Keio_proteins.fasta
  22. Korea_proteins.fasta
  23. Koxy_proteins.fasta
  24. Lysobacter_OAE881_proteins.fasta
  25. MR1_proteins.fasta
  26. Magneto_proteins.fasta
  27. Marino_proteins.fasta
  28. Methanococcus_JJ_proteins.fasta
  29. Methanococcus_S2_proteins.fasta
  30. Miya_proteins.fasta
  31. PS_proteins.fasta
  32. PV4_proteins.fasta
  33. Pedo557_pro

## Initialize RAST Client

In [4]:
print("Initializing RAST client...")
rast = RastClient()

print("\n⚠️  IMPORTANT NOTES:")
print("  - RAST annotation requires internet connection")
print("  - Each genome takes 2-4 hours to annotate")
print("  - 57 genomes = ~114-228 hours total if run serially")
print("  - You can stop and resume this notebook")
print("  - Already annotated genomes will be skipped")
print("\nRAST client ready")

Initializing RAST client...

⚠️  IMPORTANT NOTES:
  - RAST annotation requires internet connection
  - Each genome takes 2-4 hours to annotate
  - 57 genomes = ~114-228 hours total if run serially
  - You can stop and resume this notebook
  - Already annotated genomes will be skipped

RAST client ready


## Helper Functions

In [5]:
def log_message(message):
    """Log message to file and print to console"""
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    log_msg = f"[{timestamp}] {message}"
    print(log_msg)
    
    with open(LOG_FILE, 'a') as f:
        f.write(log_msg + '\n')


def get_organism_id(fasta_filename):
    """Extract organism ID from FASTA filename"""
    # Remove _proteins.fasta suffix
    return fasta_filename.replace('_proteins.fasta', '')


def genome_already_annotated(organism_id):
    """Check if genome has already been annotated"""
    output_file = OUTPUT_DIR / f"{organism_id}_genome.pkl"
    return output_file.exists()


def save_genome(genome, organism_id):
    """Save annotated genome as pickle file"""
    output_file = OUTPUT_DIR / f"{organism_id}_genome.pkl"
    with open(output_file, 'wb') as f:
        pickle.dump(genome, f)
    return output_file


def load_genome(organism_id):
    """Load previously annotated genome"""
    input_file = OUTPUT_DIR / f"{organism_id}_genome.pkl"
    with open(input_file, 'rb') as f:
        return pickle.load(f)




def save_genome_as_fasta(genome, organism_id):
    """Save annotated genome as FASTA file with functional annotations"""
    output_file = FASTA_OUTPUT_DIR / f"{organism_id}_RAST.fasta"

    with open(output_file, 'w') as f:
        for feature in genome.features:
            # Create FASTA header with annotations
            header = f">{feature.id}"

            # Add functional roles if annotated
            if hasattr(feature, 'ontology_terms') and feature.ontology_terms:
                # Convert to list and take first 3 roles
                roles_list = list(feature.ontology_terms)
                roles = "; ".join(str(r) for r in roles_list[:3])
                header += f" | {roles}"
            else:
                header += " | Hypothetical protein"

            # Write header
            f.write(header + '\n')
            
            # Write sequence (convert to string if needed)
            seq_str = str(feature.seq) if hasattr(feature, 'seq') else ""
            f.write(seq_str + '\n')

    return output_file

def fasta_already_exists(organism_id):
    """Check if FASTA output already exists"""
    output_file = FASTA_OUTPUT_DIR / f"{organism_id}_RAST.fasta"
    return output_file.exists()


print("Helper functions defined")

Helper functions defined


## Annotate Genomes

This cell annotates all genomes sequentially. You can:
- Run all at once (will take many hours)
- Run one at a time by setting `batch_size = 1`
- Stop and resume anytime (already annotated genomes are skipped)

In [6]:
# Configuration
BATCH_SIZE = None  # Set to None to process all, or set to N to process N genomes
START_INDEX = 0    # Set to skip first N genomes (useful for resuming)

log_message("="*80)
log_message("STARTING RAST ANNOTATION")
log_message("="*80)

# Select genomes to process
if BATCH_SIZE:
    genomes_to_process = fasta_files[START_INDEX:START_INDEX+BATCH_SIZE]
else:
    genomes_to_process = fasta_files[START_INDEX:]

log_message(f"Processing {len(genomes_to_process)} genomes (starting from index {START_INDEX})")

# Track progress
total_genomes = len(genomes_to_process)
completed = 0
skipped = 0
failed = 0

for i, fasta_file in enumerate(genomes_to_process, 1):
    organism_id = get_organism_id(fasta_file.name)
    
    log_message(f"\n[{i}/{total_genomes}] Processing: {organism_id}")
    log_message(f"  FASTA file: {fasta_file.name}")
    
    # Check if already annotated (both pickle and FASTA exist)
    if genome_already_annotated(organism_id) and fasta_already_exists(organism_id):
        log_message(f"  ✓ Already annotated - skipping")
        skipped += 1
        continue
    
    try:
        # Load genome from FASTA
        log_message(f"  Loading genome from FASTA...")
        genome = MSGenome.from_fasta(str(fasta_file))
        log_message(f"  Loaded {len(genome.features)} features")
        
        # Annotate with RAST
        log_message(f"  Submitting to RAST for annotation...")
        log_message(f"  ⏳ This will take 2-4 hours - please be patient")
        
        start_time = time.time()
        result = rast.annotate_genome(genome)
        elapsed_time = time.time() - start_time
        
        log_message(f"  ✓ Annotation complete ({elapsed_time/60:.1f} minutes)")
        
        # Count annotations
        annotated_features = sum(1 for f in genome.features if f.ontology_terms)
        log_message(f"  Annotated features: {annotated_features}/{len(genome.features)}")
        
        # Save genome as pickle
        output_file = save_genome(genome, organism_id)
        log_message(f"  ✓ Saved pickle: {output_file.name}")

        # Save genome as annotated FASTA
        fasta_file = save_genome_as_fasta(genome, organism_id)
        log_message(f"  ✓ Saved FASTA: {fasta_file.name}")
        
        completed += 1
        
    except Exception as e:
        log_message(f"  ✗ ERROR: {str(e)}")
        failed += 1
        continue

log_message(f"\n{'='*80}")
log_message("ANNOTATION SUMMARY")
log_message(f"{'='*80}")
log_message(f"Total processed: {total_genomes}")
log_message(f"  Completed: {completed}")
log_message(f"  Skipped (already done): {skipped}")
log_message(f"  Failed: {failed}")
log_message(f"\nOutput directory: {OUTPUT_DIR}")
log_message(f"{'='*80}")

[2025-10-14 20:01:36] STARTING RAST ANNOTATION
[2025-10-14 20:01:36] Processing 57 genomes (starting from index 0)
[2025-10-14 20:01:36] 
[1/57] Processing: ANA3
[2025-10-14 20:01:36]   FASTA file: ANA3_proteins.fasta
[2025-10-14 20:01:36]   Loading genome from FASTA...
[2025-10-14 20:01:36]   Loaded 4360 features
[2025-10-14 20:01:36]   Submitting to RAST for annotation...
[2025-10-14 20:01:36]   ⏳ This will take 2-4 hours - please be patient
[2025-10-14 20:01:40]   ✓ Annotation complete (0.1 minutes)
[2025-10-14 20:01:40]   Annotated features: 4071/4360
[2025-10-14 20:01:40]   ✓ Saved pickle: ANA3_genome.pkl
[2025-10-14 20:01:40]   ✓ Saved FASTA: ANA3_RAST.fasta
[2025-10-14 20:01:40] 
[2/57] Processing: BFirm
[2025-10-14 20:01:40]   FASTA file: BFirm_proteins.fasta
[2025-10-14 20:01:40]   Loading genome from FASTA...
[2025-10-14 20:01:40]   Loaded 7182 features
[2025-10-14 20:01:40]   Submitting to RAST for annotation...
[2025-10-14 20:01:40]   ⏳ This will take 2-4 hours - please be 

## Check Annotation Status

In [7]:
print("Checking annotation status for all organisms...")
print("="*80)

annotated_genomes = []
pending_genomes = []

for fasta_file in fasta_files:
    organism_id = get_organism_id(fasta_file.name)
    
    if genome_already_annotated(organism_id):
        annotated_genomes.append(organism_id)
    else:
        pending_genomes.append(organism_id)

print(f"\nAnnotated: {len(annotated_genomes)}/{len(fasta_files)} ({100*len(annotated_genomes)/len(fasta_files):.1f}%)")
print(f"Pending: {len(pending_genomes)} ({100*len(pending_genomes)/len(fasta_files):.1f}%)")

if pending_genomes:
    print(f"\n⚠️  Still need to annotate:")
    for org_id in pending_genomes[:10]:
        print(f"  - {org_id}")
    if len(pending_genomes) > 10:
        print(f"  ... and {len(pending_genomes) - 10} more")
else:
    print(f"\n✓ All genomes annotated!")

print("="*80)

Checking annotation status for all organisms...

Annotated: 57/57 (100.0%)
Pending: 0 (0.0%)

✓ All genomes annotated!


## Load and Inspect Annotated Genome (Example)

In [8]:
if annotated_genomes:
    # Load first annotated genome as example
    example_org = annotated_genomes[0]
    print(f"Loading example genome: {example_org}")
    
    genome = load_genome(example_org)
    
    print(f"\nGenome information:")
    print(f"  Total features: {len(genome.features)}")
    
    # Count annotated features
    annotated = [f for f in genome.features if f.ontology_terms]
    print(f"  Annotated features: {len(annotated)}")
    print(f"  Annotation rate: {100*len(annotated)/len(genome.features):.1f}%")
    
    # Show first 5 annotated features
    print(f"\nFirst 5 annotated features:")
    for i, feature in enumerate(annotated[:5], 1):
        print(f"  {i}. {feature.id}")
        print(f"     Roles: {feature.ontology_terms[:2] if len(feature.ontology_terms) > 2 else feature.ontology_terms}")
else:
    print("No annotated genomes found. Please run annotation first.")

Loading example genome: ANA3

Genome information:
  Total features: 4360
  Annotated features: 4071
  Annotation rate: 93.4%

First 5 annotated features:
  1. ANA3:7022746
     Roles: {'RAST': ['16S rRNA (guanine(527)-N(7))-methyltransferase (EC 2.1.1.170)']}
  2. ANA3:7022747
     Roles: {'RAST': ['tRNA-5-carboxymethylaminomethyl-2-thiouridine(34) synthesis protein MnmG']}
  3. ANA3:7022748
     Roles: {'RAST': ['Flavoprotein MioC']}
  4. ANA3:7022749
     Roles: {'RAST': ['amino acid/peptide transporter']}
  5. ANA3:7022750
     Roles: {'RAST': ['tRNA-5-carboxymethylaminomethyl-2-thiouridine(34) synthesis protein MnmE']}


## Summary

- `results/genomes/{organism_id}_genome.pkl` - Annotated genome objects (57 files)
- `results/fasta_annotated/{organism_id}_RAST.fasta` - Annotated FASTA files (57 files)
- `results/rast_annotation_log.txt` - Annotation log with timestamps

**Next Steps**:
1. Wait for all RAST annotations to complete
2. Proceed to next notebook: `02-build-metabolic-models.ipynb`
3. Build genome-scale metabolic models using annotated genomes + templates

**Important Notes**:
- You can stop and resume this notebook anytime
- Already annotated genomes will be skipped automatically
- Check `rast_annotation_log.txt` for detailed progress
- RAST requires internet connection
- Total time: ~114-228 hours for 57 genomes (if run serially)

**Troubleshooting**:
- If RAST fails, check internet connection
- If annotation hangs, restart the cell (progress is saved)
- For issues, check RAST service status at http://rast.nmpdr.org