# Download Protein Sequences for Metabolic Model Building

**Purpose**: Download protein sequences (FASTA) for all 57 organisms in the carbon source growth matrix

**Data Source**: Fitness Browser orgSeqs.cgi endpoint

**Organism List**: Loaded from CDMSCI-196 carbon source growth matrix

**Output**:
- `../data/raw/protein_sequences/` - One FASTA file per organism (57 total)
- `../data/raw/logs/` - Download logs and summaries

**Last updated**: 2025-10-02

## Setup

In [27]:
!pip install -q cloudscraper pandas tqdm
print("Dependencies installed")

Dependencies installed


## Install Dependencies

First, install cloudscraper to bypass Cloudflare protection

In [28]:
import cloudscraper
import pandas as pd
from pathlib import Path
import time
from datetime import datetime
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

print("Imports successful")

Imports successful


## Configuration

In [None]:
# Base URLs
BASE_URL = "https://fit.genomics.lbl.gov/cgi-bin"
PROTEIN_URL = f"{BASE_URL}/orgSeqs.cgi"

# Output directories (use parent project's data directory)
DATA_DIR = Path("../data/raw")
PROTEIN_DIR = DATA_DIR / "protein_sequences"
LOG_DIR = DATA_DIR / "logs"

# Create directories
PROTEIN_DIR.mkdir(parents=True, exist_ok=True)
LOG_DIR.mkdir(parents=True, exist_ok=True)

# Log file
LOG_FILE = LOG_DIR / "download_log.txt"

# Request settings
TIMEOUT = 60  # seconds
RETRY_ATTEMPTS = 3
DELAY_BETWEEN_REQUESTS = 1  # seconds (be polite to server)

print(f"Configuration set")
print(f"  Protein sequences → {PROTEIN_DIR}")
print(f"  Logs → {LOG_FILE}")

## Load Organism List from Carbon Source Growth Matrix

Load the list of all 57 organisms from CDMSCI-196 carbon source growth matrix

In [30]:
# Load carbon source growth matrix from CDMSCI-196
MATRIX_FILE = Path("../CDMSCI-196-carbon-sources/results/carbon_source_growth_matrix.csv")

if not MATRIX_FILE.exists():
    raise FileNotFoundError(f"Carbon source growth matrix not found at {MATRIX_FILE}. Please run CDMSCI-196 first.")

# Read matrix
growth_matrix = pd.read_csv(MATRIX_FILE, index_col=0)

# Get organism list
ORGANISMS = growth_matrix.index.tolist()

print(f"Loaded {len(ORGANISMS)} organisms from carbon source growth matrix")
print(f"\nAll organisms:")
print("="*70)
for i, org in enumerate(ORGANISMS, 1):
    print(f"  {i:2d}. {org}")

Loaded 57 organisms from carbon source growth matrix

All organisms:
   1. ANA3
   2. BFirm
   3. Bifido
   4. Brev2
   5. Btheta
   6. Burk376
   7. Burkholderia_OAS925
   8. Bvulgatus_CL09T03C04
   9. CL21
  10. Caulo
  11. Cola
  12. Cup4G11
  13. Dda3937
  14. Ddia6719
  15. DdiaME23
  16. Dino
  17. DvH
  18. Dyella79
  19. HerbieS
  20. Kang
  21. Keio
  22. Korea
  23. Koxy
  24. Lysobacter_OAE881
  25. MR1
  26. Magneto
  27. Marino
  28. Methanococcus_JJ
  29. Methanococcus_S2
  30. Miya
  31. PS
  32. PV4
  33. Pedo557
  34. Phaeo
  35. Ponti
  36. Putida
  37. RalstoniaBSBF1503
  38. RalstoniaGMI1000
  39. RalstoniaPSI07
  40. RalstoniaUW163
  41. SB2B
  42. Smeli
  43. SynE
  44. SyringaeB728a
  45. SyringaeB728a_mexBdelta
  46. Variovorax_OAS795
  47. WCS417
  48. Xantho
  49. acidovorax_3H11
  50. azobra
  51. psRCH2
  52. pseudo13_GW456_L13
  53. pseudo1_N1B4
  54. pseudo3_N2E3
  55. pseudo5_N2C3_1
  56. pseudo6_N2E2
  57. rhodanobacter_10B01


## Helper Functions

In [31]:
# Create cloudscraper session (bypasses Cloudflare protection)
scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'darwin',
        'desktop': True
    }
)

def log_message(message, also_print=True):
    """Write message to log file with timestamp"""
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    log_entry = f"[{timestamp}] {message}\n"
    
    with open(LOG_FILE, 'a') as f:
        f.write(log_entry)
    
    if also_print:
        print(message)


def download_file(url, output_path, description, max_retries=RETRY_ATTEMPTS):
    """
    Download file from URL with retry logic
    Uses cloudscraper to bypass Cloudflare protection
    
    Returns:
        tuple: (success: bool, file_size: int, error_message: str)
    """
    for attempt in range(1, max_retries + 1):
        try:
            response = scraper.get(url, timeout=TIMEOUT)
            response.raise_for_status()
            
            # Check if response is not an error page
            if len(response.content) < 100 and b'error' in response.content.lower():
                return False, 0, "Server returned error page"
            
            # Check if we got a challenge page (shouldn't happen with cloudscraper)
            if b'challenge' in response.content.lower() and len(response.content) < 10000:
                return False, 0, "Cloudflare challenge not bypassed"
            
            # Save file
            with open(output_path, 'wb') as f:
                f.write(response.content)
            
            file_size = len(response.content)
            return True, file_size, None
            
        except Exception as e:
            if attempt < max_retries:
                log_message(f"  Error on attempt {attempt}/{max_retries}: {str(e)}", also_print=False)
                time.sleep(2)
            else:
                return False, 0, str(e)
    
    return False, 0, "Unknown error"


def format_size(size_bytes):
    """Format bytes to human-readable size"""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_bytes < 1024.0:
            return f"{size_bytes:.1f} {unit}"
        size_bytes /= 1024.0
    return f"{size_bytes:.1f} TB"


print("Helper functions defined")
print("Cloudscraper session created (bypasses Cloudflare protection)")

Helper functions defined
Cloudscraper session created (bypasses Cloudflare protection)


## Download Functions

In [32]:
def download_protein_sequences(org_id, skip_existing=True):
    """
    Download protein sequences for an organism
    
    Args:
        org_id: Organism identifier (e.g., 'Keio', 'acidovorax_3H11')
        skip_existing: If True, skip download if file already exists
    
    Returns:
        tuple: (success: bool, file_path: Path, file_size: int, error: str, skipped: bool)
    """
    output_file = PROTEIN_DIR / f"{org_id}_proteins.fasta"
    
    # Check if file already exists
    if skip_existing and output_file.exists():
        file_size = output_file.stat().st_size
        # Verify it's a valid FASTA file (has at least 1 sequence header)
        try:
            with open(output_file, 'r') as f:
                first_line = f.readline()
                if first_line.startswith('>'):
                    log_message(f"  {org_id}: SKIPPED (already exists, {format_size(file_size)})", also_print=False)
                    return True, output_file, file_size, None, True
        except:
            pass  # File exists but corrupted, re-download
    
    # Download file
    url = f"{PROTEIN_URL}?orgId={org_id}"
    success, size, error = download_file(url, output_file, "protein sequences")
    
    if success:
        log_message(f"  {org_id}: DOWNLOADED {format_size(size)}", also_print=False)
    else:
        log_message(f"  {org_id} FAILED: {error}", also_print=False)
    
    return success, output_file, size, error, False


print("Download function defined")

Download function defined


## Test Download (Single Organism)

Test with one organism before running full download

In [33]:
# Test with first organism
test_org = ORGANISMS[0]
print(f"Testing download for: {test_org}")
print("="*50)

success, path, size, error, skipped = download_protein_sequences(test_org)

print("\nTest Results:")
print(f"  Protein sequences: {'SKIPPED' if skipped else 'SUCCESS' if success else 'FAILED'}")
if success:
    print(f"    Size: {format_size(size)}")
    print(f"    Path: {path}")
elif not skipped:
    print(f"    Error: {error}")

if success:
    print("\nSUCCESS: Test successful! Ready to download all organisms.")
else:
    print("\nWARNING: Test had issues. Check errors above before proceeding.")

Testing download for: ANA3

Test Results:
  Protein sequences: SKIPPED
    Size: 1.7 MB
    Path: data/protein_sequences/ANA3_proteins.fasta

SUCCESS: Test successful! Ready to download all organisms.


## Download All Organisms

WARNING: This will download protein sequences for all 57 organisms. Should take 2-5 minutes depending on connection speed.

In [34]:
# Initialize log
log_message("="*70)
log_message(f"Starting download for {len(ORGANISMS)} organisms")
log_message("="*70)

# Track results
all_results = []
total_size = 0
successful = 0
skipped = 0

# Download with progress bar
for org_id in tqdm(ORGANISMS, desc="Downloading protein sequences"):
    success, path, size, error, was_skipped = download_protein_sequences(org_id)
    
    all_results.append({
        'org_id': org_id,
        'success': success,
        'skipped': was_skipped,
        'size': size,
        'size_readable': format_size(size),
        'path': str(path),
        'error': error
    })
    
    if success:
        successful += 1
        total_size += size
        if was_skipped:
            skipped += 1
    
    # Delay between requests (skip for already-downloaded files)
    if not was_skipped:
        time.sleep(DELAY_BETWEEN_REQUESTS)

# Summary
print("\n" + "="*70)
print("DOWNLOAD COMPLETE")
print("="*70)
print(f"\nProtein Sequences:")
print(f"  Successful: {successful}/{len(ORGANISMS)}")
if skipped > 0:
    print(f"  Skipped (already existed): {skipped}")
    print(f"  Newly downloaded: {successful - skipped}")
print(f"  Total size: {format_size(total_size)}")

log_message("="*70)
log_message(f"Download complete: {successful}/{len(ORGANISMS)} organisms")
if skipped > 0:
    log_message(f"Skipped (already existed): {skipped}")
log_message(f"Total size: {format_size(total_size)}")
log_message("="*70)

Starting download for 57 organisms


Downloading protein sequences:   0%|          | 0/57 [00:00<?, ?it/s]


DOWNLOAD COMPLETE

Protein Sequences:
  Successful: 57/57
  Skipped (already existed): 57
  Newly downloaded: 0
  Total size: 98.5 MB
Download complete: 57/57 organisms
Skipped (already existed): 57
Total size: 98.5 MB


## Save Download Summary

In [35]:
# Create summary DataFrame
summary_df = pd.DataFrame(all_results)

# Save summary
summary_file = LOG_DIR / f"download_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
summary_df.to_csv(summary_file, index=False)

print(f"Summary saved to: {summary_file}")
print(f"\nFirst few rows:")
display(summary_df.head(10))

Summary saved to: data/logs/download_summary_20251002_024002.csv

First few rows:


Unnamed: 0,org_id,success,skipped,size,size_readable,path,error
0,ANA3,True,True,1817199,1.7 MB,data/protein_sequences/ANA3_proteins.fasta,
1,BFirm,True,True,2745855,2.6 MB,data/protein_sequences/BFirm_proteins.fasta,
2,Bifido,True,True,807501,788.6 KB,data/protein_sequences/Bifido_proteins.fasta,
3,Brev2,True,True,1192489,1.1 MB,data/protein_sequences/Brev2_proteins.fasta,
4,Btheta,True,True,2246250,2.1 MB,data/protein_sequences/Btheta_proteins.fasta,
5,Burk376,True,True,2654112,2.5 MB,data/protein_sequences/Burk376_proteins.fasta,
6,Burkholderia_OAS925,True,True,2654281,2.5 MB,data/protein_sequences/Burkholderia_OAS925_pro...,
7,Bvulgatus_CL09T03C04,True,True,1772467,1.7 MB,data/protein_sequences/Bvulgatus_CL09T03C04_pr...,
8,CL21,True,True,1841805,1.8 MB,data/protein_sequences/CL21_proteins.fasta,
9,Caulo,True,True,1458861,1.4 MB,data/protein_sequences/Caulo_proteins.fasta,


## Check Downloaded Files

In [36]:
# Count files
protein_files = list(PROTEIN_DIR.glob("*.fasta"))

print(f"Files in {PROTEIN_DIR}:")
print(f"  {len(protein_files)} protein sequence files")

# Show some examples
print(f"\nExample protein sequence files:")
for f in sorted(protein_files)[:10]:
    size = f.stat().st_size
    print(f"  {f.name}: {format_size(size)}")

Files in data/protein_sequences:
  57 protein sequence files

Example protein sequence files:
  ANA3_proteins.fasta: 1.7 MB
  BFirm_proteins.fasta: 2.6 MB
  Bifido_proteins.fasta: 788.6 KB
  Brev2_proteins.fasta: 1.1 MB
  Btheta_proteins.fasta: 2.1 MB
  Burk376_proteins.fasta: 2.5 MB
  Burkholderia_OAS925_proteins.fasta: 2.5 MB
  Bvulgatus_CL09T03C04_proteins.fasta: 1.7 MB
  CL21_proteins.fasta: 1.8 MB
  Caulo_proteins.fasta: 1.4 MB


## Quick Data Inspection

Preview downloaded files to confirm they're valid

In [37]:
# Check protein sequences
if protein_files:
    test_fasta = protein_files[0]
    print(f"Inspecting: {test_fasta.name}")
    print("="*70)
    with open(test_fasta, 'r') as f:
        lines = f.readlines()[:20]  # First 20 lines
        print(''.join(lines))
    
    # Count sequences
    with open(test_fasta, 'r') as f:
        n_seqs = sum(1 for line in f if line.startswith('>'))
    print(f"\nTotal sequences in file: {n_seqs}")

Inspecting: Lysobacter_OAE881_proteins.fasta
>Lysobacter_OAE881:ABIE51_RS00030 ABIE51_RS00030 tyrosine--tRNA ligase
MSPQISAPVQDALDLIGRGADEILKREELEARLKLGRPLRVKAGFDPTAPDLHLGHTVLLNKMRQFQDLGHQVIFLIGDFTGMIGDPTGKNVTRKPLTREDVLANAQTYADQVFKVLDKERTEVRFNSEWFGQMGAADMIKLAAQHTVARMLERDDFAKRYAAQQSIAIHEFLYPLVQGYDSVALKADVELGGTDQKFNLLMGRGLQEHYGQAPQIVLTMPLLEGLDGVNKMSKSLGNYIGINEPAIDMVTKTMKIDDVLMWRWIDLLSFEIGIAEAAQLKRDIEGGQLNPRDLKLRLARELAARFHGESAAELAIAGWNAAVRGEGDIASLPINDVAVPAEGVRIAALLTAAGLTPSNSEANRKLKERAVRLDGEVIEDVQRVFVPGFEGVLAVGKRTFARVRLIAA
>Lysobacter_OAE881:ABIE51_RS00035 ABIE51_RS00035 peptidoglycan DD-metalloendopeptidase family protein
MTTSETGTDRRQRLKALREAALHRPVAASQVSRGFNGRWTRRHWAHASLFATLGVLVAAIVPGFGPTSNVPAPVHAQRSSLALALPPLPLSRLKGHSGDSWQIVRVERGQTLGAVFEDMDLPASTMHQILEAVPNDKSVLTRLKPGTELAFDLPVSGELRTFRYDRDDSHRVELAIAGDKVTEKVTVRPTETRTVVISGKVGKSLFRSARKLGLSGNNINTLTDDIFKYDIDFNEDVGADDRFSVVVEQTWREGELLRTGPVLAATFTTGGKLHTGFRFERDGKAEYFTGDGRPLKKSFIRMPIPYARLTSNFGTRKHPVLGRTRMHKGVDYAARTGTPIMAAGDARVVSAGWQGGYGNAVVLDHGRGYSTLY

## Failed Downloads

Check which organisms (if any) failed to download

In [38]:
# Find failed downloads
failed = summary_df[~summary_df['success']]

if len(failed) > 0:
    print(f"WARNING: {len(failed)} organisms failed to download:")
    display(failed[['org_id', 'error']])
else:
    print("SUCCESS: All protein sequences downloaded successfully")

SUCCESS: All protein sequences downloaded successfully


## Retry Failed Downloads (Optional)

If some downloads failed, you can retry them

In [None]:
# Uncomment to retry failed downloads

# failed_orgs = set(failed_proteins['org_id'].tolist() + failed_experiments['org_id'].tolist())

# if failed_orgs:
#     print(f"Retrying {len(failed_orgs)} failed organisms...")
#     
#     for org_id in tqdm(failed_orgs, desc="Retrying"):
#         result = download_organism_data(org_id)
#         time.sleep(DELAY_BETWEEN_REQUESTS)
#     
#     print("✓ Retry complete")
# else:
#     print("No failed downloads to retry")

In [40]:
# Uncomment to retry failed downloads

# if len(failed) > 0:
#     print(f"Retrying {len(failed)} failed organisms...")
#     
#     for org_id in tqdm(failed['org_id'], desc="Retrying"):
#         success, path, size, error, skipped = download_protein_sequences(org_id, skip_existing=False)
#         if not skipped:
#             time.sleep(DELAY_BETWEEN_REQUESTS)
#     
#     print("Retry complete")
# else:
#     print("No failed downloads to retry")

In [41]:
# Calculate statistics
print("DOWNLOAD SUMMARY")
print("="*70)
print(f"\nOrganisms processed: {len(ORGANISMS)}")
print(f"\nProtein Sequences:")
print(f"  Downloaded: {summary_df['success'].sum()} / {len(ORGANISMS)}")
print(f"  Success rate: {100 * summary_df['success'].mean():.1f}%")
print(f"  Total size: {format_size(summary_df['size'].sum())}")
print(f"  Average size: {format_size(summary_df['size'].mean())}")

print(f"\nReady for next step: RAST annotation and model building")

DOWNLOAD SUMMARY

Organisms processed: 57

Protein Sequences:
  Downloaded: 57 / 57
  Success rate: 100.0%
  Total size: 98.5 MB
  Average size: 1.7 MB

Ready for next step: RAST annotation and model building


## Next Steps

After running this notebook:

1. Verify all 57 organisms downloaded successfully
2. Check that FASTA files are valid (contain protein sequences)
3. Proceed to genome annotation (RAST or alternative)
4. Build metabolic models using ModelSEED/KBase

**Files created**:
- `../data/raw/protein_sequences/*.fasta` - Protein sequences (one per organism, 57 total)
- `../data/raw/logs/download_log.txt` - Detailed download log
- `../data/raw/logs/download_summary_*.csv` - Summary CSV with results