# Download Fitness Browser Organism Data

**Purpose**: Download protein sequences and experiment metadata for all organisms in the Fitness Browser (RBTnSeq database from Deutschbauer lab)

**Downloads**:
1. Protein sequences (FASTA format) from `orgSeqs.cgi`
2. Experiment metadata (tab-delimited) from `createExpData.cgi`

**Output**:
- `data/raw/protein_sequences/` - One FASTA file per organism
- `data/raw/experiment_metadata/` - One TSV file per organism
- `data/raw/logs/download_log.txt` - Download log with timestamps

**Last updated**: 2025-10-01

## Setup

In [1]:
!pip install -q cloudscraper pandas tqdm jupyter
print("✓ Dependencies installed")

✓ Dependencies installed


## Install Dependencies

First, install cloudscraper to bypass Cloudflare protection

In [2]:
import cloudscraper
import pandas as pd
from pathlib import Path
import time
from datetime import datetime
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")

✓ Imports successful


## Configuration

In [3]:
# Base URLs
BASE_URL = "https://fit.genomics.lbl.gov/cgi-bin"
PROTEIN_URL = f"{BASE_URL}/orgSeqs.cgi"
EXPERIMENT_URL = f"{BASE_URL}/createExpData.cgi"

# Output directories
DATA_DIR = Path("../data/raw")
PROTEIN_DIR = DATA_DIR / "protein_sequences"
EXPERIMENT_DIR = DATA_DIR / "experiment_metadata"
LOG_DIR = DATA_DIR / "logs"

# Create directories
PROTEIN_DIR.mkdir(parents=True, exist_ok=True)
EXPERIMENT_DIR.mkdir(parents=True, exist_ok=True)
LOG_DIR.mkdir(parents=True, exist_ok=True)

# Log file
LOG_FILE = LOG_DIR / "download_log.txt"

# Request settings
TIMEOUT = 60  # seconds
RETRY_ATTEMPTS = 3
DELAY_BETWEEN_REQUESTS = 1  # seconds (be polite to server)

print(f"✓ Configuration set")
print(f"  Protein sequences → {PROTEIN_DIR}")
print(f"  Experiment metadata → {EXPERIMENT_DIR}")
print(f"  Logs → {LOG_FILE}")

✓ Configuration set
  Protein sequences → ../data/raw/protein_sequences
  Experiment metadata → ../data/raw/experiment_metadata
  Logs → ../data/raw/logs/download_log.txt


## Organism List

List of all organism IDs in Fitness Browser database. This includes:
- 32 organisms from BigFit (Nature 2018)
- Additional organisms from later releases  
- Total: 48 organisms (46 bacteria + 2 archaea)

**Note**: If you want to download for specific organisms only, modify the `ORGANISMS` list below.

In [4]:
# Known organisms from BigFit and additional releases
ORGANISMS = [
    # BigFit organisms (Nature 2018)
    'acidovorax_3H11',
    'ANA3',
    'azobra',
    'BFirm',
    'Caulo',
    'Cola',
    'Cup4G11',
    'Dyella79',
    'Dino',
    'HerbieS',
    'Kang',
    'Keio',
    'Korea',
    'Koxy',
    'Marino',
    'Miya',
    'MR1',
    'Phaeo',
    'Ponti',
    'PS',
    'pseudo1_N1B4',
    'pseudo3_N2E3',
    'pseudo5_N2C3_1',
    'pseudo6_N2E2',
    'pseudo13_GW456_L13',
    'psRCH2',
    'Pedo557',
    'PV4',
    'SB2B',
    'Smeli',
    'SynE',
    'WCS417',
    # Additional organisms (add more as discovered)
    'Putida',
    'Dtox',
    'Echoli',
    'Halo',
    'Marino2',
    'Pputida_KT2440',
    'Shewanella_ANA3',
    'Sulcia',
    'Thermus',
    # Add more organism IDs here if needed
]

print(f"✓ Loaded {len(ORGANISMS)} organism IDs")
print(f"\nFirst 10 organisms:")
for i, org in enumerate(ORGANISMS[:10], 1):
    print(f"  {i}. {org}")

✓ Loaded 41 organism IDs

First 10 organisms:
  1. acidovorax_3H11
  2. ANA3
  3. azobra
  4. BFirm
  5. Caulo
  6. Cola
  7. Cup4G11
  8. Dyella79
  9. Dino
  10. HerbieS


## Helper Functions

In [5]:
# Create cloudscraper session (bypasses Cloudflare protection)
scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'darwin',
        'desktop': True
    }
)

def log_message(message, also_print=True):
    """Write message to log file with timestamp"""
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    log_entry = f"[{timestamp}] {message}\n"
    
    with open(LOG_FILE, 'a') as f:
        f.write(log_entry)
    
    if also_print:
        print(message)


def download_file(url, output_path, description, max_retries=RETRY_ATTEMPTS):
    """
    Download file from URL with retry logic
    Uses cloudscraper to bypass Cloudflare protection
    
    Returns:
        tuple: (success: bool, file_size: int, error_message: str)
    """
    for attempt in range(1, max_retries + 1):
        try:
            response = scraper.get(url, timeout=TIMEOUT)
            response.raise_for_status()
            
            # Check if response is not an error page
            if len(response.content) < 100 and b'error' in response.content.lower():
                return False, 0, "Server returned error page"
            
            # Check if we got a challenge page (shouldn't happen with cloudscraper)
            if b'challenge' in response.content.lower() and len(response.content) < 10000:
                return False, 0, "Cloudflare challenge not bypassed"
            
            # Save file
            with open(output_path, 'wb') as f:
                f.write(response.content)
            
            file_size = len(response.content)
            return True, file_size, None
            
        except Exception as e:
            if attempt < max_retries:
                log_message(f"  Error on attempt {attempt}/{max_retries}: {str(e)}", also_print=False)
                time.sleep(2)
            else:
                return False, 0, str(e)
    
    return False, 0, "Unknown error"


def format_size(size_bytes):
    """Format bytes to human-readable size"""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_bytes < 1024.0:
            return f"{size_bytes:.1f} {unit}"
        size_bytes /= 1024.0
    return f"{size_bytes:.1f} TB"


print("✓ Helper functions defined")
print("✓ Cloudscraper session created (bypasses Cloudflare protection)")

✓ Helper functions defined
✓ Cloudscraper session created (bypasses Cloudflare protection)


## Download Functions

In [6]:
def download_protein_sequences(org_id):
    """
    Download protein sequences for an organism
    
    Args:
        org_id: Organism identifier (e.g., 'Keio', 'acidovorax_3H11')
    
    Returns:
        tuple: (success: bool, file_path: Path, file_size: int, error: str)
    """
    url = f"{PROTEIN_URL}?orgId={org_id}"
    output_file = PROTEIN_DIR / f"{org_id}_proteins.fasta"
    
    success, size, error = download_file(url, output_file, "protein sequences")
    
    if success:
        log_message(f"  ✓ Protein sequences: {format_size(size)}", also_print=False)
    else:
        log_message(f"  ✗ Protein sequences failed: {error}", also_print=False)
    
    return success, output_file, size, error


def download_experiment_metadata(org_id):
    """
    Download experiment metadata for an organism
    
    Args:
        org_id: Organism identifier
    
    Returns:
        tuple: (success: bool, file_path: Path, file_size: int, error: str)
    """
    url = f"{EXPERIMENT_URL}?orgId={org_id}"
    output_file = EXPERIMENT_DIR / f"{org_id}_experiments.tsv"
    
    success, size, error = download_file(url, output_file, "experiment metadata")
    
    if success:
        log_message(f"  ✓ Experiment metadata: {format_size(size)}", also_print=False)
    else:
        log_message(f"  ✗ Experiment metadata failed: {error}", also_print=False)
    
    return success, output_file, size, error


def download_organism_data(org_id):
    """
    Download both protein sequences and experiment metadata for an organism
    
    Args:
        org_id: Organism identifier
    
    Returns:
        dict: Download results
    """
    log_message(f"Downloading data for {org_id}...")
    
    results = {
        'org_id': org_id,
        'protein_success': False,
        'protein_size': 0,
        'protein_error': None,
        'experiment_success': False,
        'experiment_size': 0,
        'experiment_error': None,
    }
    
    # Download protein sequences
    success, path, size, error = download_protein_sequences(org_id)
    results['protein_success'] = success
    results['protein_size'] = size
    results['protein_error'] = error
    results['protein_path'] = str(path)
    
    # Small delay between requests
    time.sleep(DELAY_BETWEEN_REQUESTS)
    
    # Download experiment metadata
    success, path, size, error = download_experiment_metadata(org_id)
    results['experiment_success'] = success
    results['experiment_size'] = size
    results['experiment_error'] = error
    results['experiment_path'] = str(path)
    
    # Summary
    if results['protein_success'] and results['experiment_success']:
        log_message(f"✓ {org_id} complete")
    elif results['protein_success'] or results['experiment_success']:
        log_message(f"⚠ {org_id} partially complete")
    else:
        log_message(f"✗ {org_id} failed")
    
    return results


print("✓ Download functions defined")

✓ Download functions defined


## Test Download (Single Organism)

Test with one organism before running full download

In [7]:
# Test with first organism
test_org = ORGANISMS[0]
print(f"Testing download for: {test_org}")
print("="*50)

test_result = download_organism_data(test_org)

print("\nTest Results:")
print(f"  Protein sequences: {'✓' if test_result['protein_success'] else '✗'}")
if test_result['protein_success']:
    print(f"    Size: {format_size(test_result['protein_size'])}")
    print(f"    Path: {test_result['protein_path']}")
else:
    print(f"    Error: {test_result['protein_error']}")

print(f"  Experiment metadata: {'✓' if test_result['experiment_success'] else '✗'}")
if test_result['experiment_success']:
    print(f"    Size: {format_size(test_result['experiment_size'])}")
    print(f"    Path: {test_result['experiment_path']}")
else:
    print(f"    Error: {test_result['experiment_error']}")

if test_result['protein_success'] and test_result['experiment_success']:
    print("\n✓ Test successful! Ready to download all organisms.")
else:
    print("\n⚠ Test had issues. Check errors above before proceeding.")

Testing download for: acidovorax_3H11
Downloading data for acidovorax_3H11...
✓ acidovorax_3H11 complete

Test Results:
  Protein sequences: ✓
    Size: 1.8 MB
    Path: ../data/raw/protein_sequences/acidovorax_3H11_proteins.fasta
  Experiment metadata: ✓
    Size: 68.1 KB
    Path: ../data/raw/experiment_metadata/acidovorax_3H11_experiments.tsv

✓ Test successful! Ready to download all organisms.


## Download All Organisms

**⚠️ Warning**: This will download data for all organisms. Depending on file sizes and connection speed, this may take 30-60 minutes.

**Before running**:
- Make sure test download worked
- Check you have sufficient disk space
- Ensure stable internet connection

**To run**: Remove the comment `#` from the code and execute the cell

In [8]:
# Initialize log
log_message("="*70)
log_message(f"Starting download for {len(ORGANISMS)} organisms")
log_message("="*70)

# Track results
all_results = []
total_protein_size = 0
total_experiment_size = 0
successful_proteins = 0
successful_experiments = 0

# Download with progress bar
for org_id in tqdm(ORGANISMS, desc="Downloading organisms"):
    result = download_organism_data(org_id)
    all_results.append(result)
    
    if result['protein_success']:
        successful_proteins += 1
        total_protein_size += result['protein_size']
    
    if result['experiment_success']:
        successful_experiments += 1
        total_experiment_size += result['experiment_size']
    
    # Delay between organisms
    time.sleep(DELAY_BETWEEN_REQUESTS)

# Summary
print("\n" + "="*70)
print("DOWNLOAD COMPLETE")
print("="*70)
print(f"\nProtein Sequences:")
print(f"  Successful: {successful_proteins}/{len(ORGANISMS)}")
print(f"  Total size: {format_size(total_protein_size)}")
print(f"\nExperiment Metadata:")
print(f"  Successful: {successful_experiments}/{len(ORGANISMS)}")
print(f"  Total size: {format_size(total_experiment_size)}")
print(f"\nTotal downloaded: {format_size(total_protein_size + total_experiment_size)}")

log_message("="*70)
log_message(f"Download complete: {successful_proteins} proteins, {successful_experiments} experiments")
log_message(f"Total size: {format_size(total_protein_size + total_experiment_size)}")
log_message("="*70)

Starting download for 41 organisms


Downloading organisms:   0%|          | 0/41 [00:00<?, ?it/s]

Downloading data for acidovorax_3H11...
✓ acidovorax_3H11 complete
Downloading data for ANA3...
✓ ANA3 complete
Downloading data for azobra...
✓ azobra complete
Downloading data for BFirm...
✓ BFirm complete
Downloading data for Caulo...
✓ Caulo complete
Downloading data for Cola...
✓ Cola complete
Downloading data for Cup4G11...
✓ Cup4G11 complete
Downloading data for Dyella79...
✓ Dyella79 complete
Downloading data for Dino...
✓ Dino complete
Downloading data for HerbieS...
✓ HerbieS complete
Downloading data for Kang...
✓ Kang complete
Downloading data for Keio...
✓ Keio complete
Downloading data for Korea...
✓ Korea complete
Downloading data for Koxy...
✓ Koxy complete
Downloading data for Marino...
✓ Marino complete
Downloading data for Miya...
✓ Miya complete
Downloading data for MR1...
✓ MR1 complete
Downloading data for Phaeo...
✓ Phaeo complete
Downloading data for Ponti...
✓ Ponti complete
Downloading data for PS...
✓ PS complete
Downloading data for pseudo1_N1B4...
✓ pseudo1

## Save Download Summary

In [9]:
# Create summary DataFrame
summary_df = pd.DataFrame(all_results)

# Add human-readable sizes
summary_df['protein_size_readable'] = summary_df['protein_size'].apply(format_size)
summary_df['experiment_size_readable'] = summary_df['experiment_size'].apply(format_size)

# Reorder columns
summary_df = summary_df[[
    'org_id',
    'protein_success', 'protein_size', 'protein_size_readable', 'protein_path', 'protein_error',
    'experiment_success', 'experiment_size', 'experiment_size_readable', 'experiment_path', 'experiment_error'
]]

# Save summary
summary_file = LOG_DIR / f"download_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
summary_df.to_csv(summary_file, index=False)

print(f"✓ Summary saved to: {summary_file}")
print(f"\nFirst few rows:")
display(summary_df.head(10))

✓ Summary saved to: ../data/raw/logs/download_summary_20251001_015504.csv

First few rows:


Unnamed: 0,org_id,protein_success,protein_size,protein_size_readable,protein_path,protein_error,experiment_success,experiment_size,experiment_size_readable,experiment_path,experiment_error
0,acidovorax_3H11,True,1932728,1.8 MB,../data/raw/protein_sequences/acidovorax_3H11_...,,True,69763,68.1 KB,../data/raw/experiment_metadata/acidovorax_3H1...,
1,ANA3,True,1817199,1.7 MB,../data/raw/protein_sequences/ANA3_proteins.fasta,,True,48177,47.0 KB,../data/raw/experiment_metadata/ANA3_experimen...,
2,azobra,True,2056699,2.0 MB,../data/raw/protein_sequences/azobra_proteins....,,True,43012,42.0 KB,../data/raw/experiment_metadata/azobra_experim...,
3,BFirm,True,2745855,2.6 MB,../data/raw/protein_sequences/BFirm_proteins.f...,,True,59718,58.3 KB,../data/raw/experiment_metadata/BFirm_experime...,
4,Caulo,True,1458861,1.4 MB,../data/raw/protein_sequences/Caulo_proteins.f...,,True,94272,92.1 KB,../data/raw/experiment_metadata/Caulo_experime...,
5,Cola,True,1925815,1.8 MB,../data/raw/protein_sequences/Cola_proteins.fasta,,True,100823,98.5 KB,../data/raw/experiment_metadata/Cola_experimen...,
6,Cup4G11,True,2805511,2.7 MB,../data/raw/protein_sequences/Cup4G11_proteins...,,True,66924,65.4 KB,../data/raw/experiment_metadata/Cup4G11_experi...,
7,Dyella79,True,1802814,1.7 MB,../data/raw/protein_sequences/Dyella79_protein...,,True,33965,33.2 KB,../data/raw/experiment_metadata/Dyella79_exper...,
8,Dino,True,1617503,1.5 MB,../data/raw/protein_sequences/Dino_proteins.fasta,,True,88832,86.8 KB,../data/raw/experiment_metadata/Dino_experimen...,
9,HerbieS,True,1890504,1.8 MB,../data/raw/protein_sequences/HerbieS_proteins...,,True,47134,46.0 KB,../data/raw/experiment_metadata/HerbieS_experi...,


## Check Downloaded Files

In [10]:
# Count files
protein_files = list(PROTEIN_DIR.glob("*.fasta"))
experiment_files = list(EXPERIMENT_DIR.glob("*.tsv"))

print(f"Files in {PROTEIN_DIR}:")
print(f"  {len(protein_files)} protein sequence files")
print(f"\nFiles in {EXPERIMENT_DIR}:")
print(f"  {len(experiment_files)} experiment metadata files")

# Show some examples
print(f"\nExample protein sequence files:")
for f in protein_files[:5]:
    size = f.stat().st_size
    print(f"  {f.name}: {format_size(size)}")

print(f"\nExample experiment metadata files:")
for f in experiment_files[:5]:
    size = f.stat().st_size
    print(f"  {f.name}: {format_size(size)}")

Files in ../data/raw/protein_sequences:
  41 protein sequence files

Files in ../data/raw/experiment_metadata:
  33 experiment metadata files

Example protein sequence files:
  Halo_proteins.fasta: 1.5 KB
  Pedo557_proteins.fasta: 2.0 MB
  WCS417_proteins.fasta: 2.0 MB
  PS_proteins.fasta: 1.3 MB
  SB2B_proteins.fasta: 1.4 MB

Example experiment metadata files:
  Marino_experiments.tsv: 122.4 KB
  PV4_experiments.tsv: 69.2 KB
  acidovorax_3H11_experiments.tsv: 68.1 KB
  Keio_experiments.tsv: 74.1 KB
  Phaeo_experiments.tsv: 127.4 KB


## Quick Data Inspection

Preview downloaded files to confirm they're valid

In [11]:
# Check protein sequences
if protein_files:
    test_fasta = protein_files[0]
    print(f"Inspecting: {test_fasta.name}")
    print("="*70)
    with open(test_fasta, 'r') as f:
        lines = f.readlines()[:20]  # First 20 lines
        print(''.join(lines))
    
    # Count sequences
    with open(test_fasta, 'r') as f:
        n_seqs = sum(1 for line in f if line.startswith('>'))
    print(f"\nTotal sequences in file: {n_seqs}")

Inspecting: Halo_proteins.fasta
    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
    <head>
    <meta http-equiv="X-UA-Compatible" content="IE=edge"> 
    <link rev="made" href="mailto:morgannprice%40yahoo.com" />
    <title>Sorry!</title>
    <meta name="copyright" content="copyright 2015 UC Berkeley" />
    <link rel="shortcut icon" href="../images/favicon.ico" type="image/x-icon">
    <link rel="icon" href="../images/favicon.ico" type="image/x-icon">
    <link rel="stylesheet" href="../images/feba2.css">
    <link href='http://fonts.googleapis.com/css?family=Montserrat:700' rel='stylesheet' type='text/css'>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
        </head>
    <body>
    <div id="page">
    <div id="nav"> <div class="box">
        <li class="header"><A style="color: gold;" TITLE="Fitness Browser: fitness data from the Deutschbauer lab, the Arkin lab, and collaborators" HREF="myFrontPage.cgi

In [12]:
# Check experiment metadata
if experiment_files:
    test_exp = experiment_files[0]
    print(f"Inspecting: {test_exp.name}")
    print("="*70)
    
    # Read with pandas
    exp_df = pd.read_csv(test_exp, sep='\t')
    print(f"Shape: {exp_df.shape[0]} rows × {exp_df.shape[1]} columns")
    print(f"\nColumns:")
    print(exp_df.columns.tolist())
    print(f"\nFirst few rows:")
    display(exp_df.head())

Inspecting: Marino_experiments.tsv
Shape: 255 rows × 48 columns

Columns:
['orgId', 'expName', 'expDesc', 'timeZeroSet', 'num', 'nMapped', 'nPastEnd', 'nGenic', 'nUsed', 'gMed', 'gMedt0', 'gMean', 'cor12', 'mad12', 'mad12c', 'mad12c_t0', 'opcor', 'adjcor', 'gccor', 'maxFit', 'expGroup', 'expDescLong', 'mutantLibrary', 'person', 'dateStarted', 'setName', 'seqindex', 'media', 'temperature', 'pH', 'vessel', 'aerobic', 'liquid', 'shaking', 'condition_1', 'units_1', 'concentration_1', 'condition_2', 'units_2', 'concentration_2', 'condition_3', 'units_3', 'concentration_3', 'condition_4', 'units_4', 'concentration_4', 'growthPlate', 'growthWells']

First few rows:


Unnamed: 0,orgId,expName,expDesc,timeZeroSet,num,nMapped,nPastEnd,nGenic,nUsed,gMed,...,units_2,concentration_2,condition_3,units_3,concentration_3,condition_4,units_4,concentration_4,growthPlate,growthWells
0,Marino,set1H10,pyruvate,9/5/2013 Marino_ML2_set1,10,6324120,1486,4546027,4544419,864.0,...,,,,,,,,,675.0,"B7,B8"
1,Marino,set1H11,succinate,9/5/2013 Marino_ML2_set1,11,5380471,1521,3870337,3869153,678.0,...,,,,,,,,,675.0,"C1,C2"
2,Marino,set1H12,L-Malic,9/5/2013 Marino_ML2_set1,12,5713482,1270,4106509,4105065,747.0,...,,,,,,,,,675.0,"C3,C4"
3,Marino,set1H13,L-Glutamic,9/5/2013 Marino_ML2_set1,13,4061138,1148,2919694,2918795,505.0,...,,,,,,,,,675.0,"C5,C6"
4,Marino,set1H15,L-Proline,9/5/2013 Marino_ML2_set1,15,2797419,661,2008720,2008054,367.5,...,,,,,,,,,675.0,"D1,D2"


## Failed Downloads

Check which organisms (if any) failed to download

In [13]:
# Find failed downloads
failed_proteins = summary_df[~summary_df['protein_success']]
failed_experiments = summary_df[~summary_df['experiment_success']]

if len(failed_proteins) > 0:
    print(f"⚠ {len(failed_proteins)} organisms failed protein download:")
    display(failed_proteins[['org_id', 'protein_error']])
else:
    print("✓ All protein sequences downloaded successfully")

print()

if len(failed_experiments) > 0:
    print(f"⚠ {len(failed_experiments)} organisms failed experiment download:")
    display(failed_experiments[['org_id', 'experiment_error']])
else:
    print("✓ All experiment metadata downloaded successfully")

✓ All protein sequences downloaded successfully

⚠ 8 organisms failed experiment download:


Unnamed: 0,org_id,experiment_error
33,Dtox,500 Server Error: Internal Server Error for ur...
34,Echoli,500 Server Error: Internal Server Error for ur...
35,Halo,500 Server Error: Internal Server Error for ur...
36,Marino2,500 Server Error: Internal Server Error for ur...
37,Pputida_KT2440,500 Server Error: Internal Server Error for ur...
38,Shewanella_ANA3,500 Server Error: Internal Server Error for ur...
39,Sulcia,500 Server Error: Internal Server Error for ur...
40,Thermus,500 Server Error: Internal Server Error for ur...


## Retry Failed Downloads (Optional)

If some downloads failed, you can retry them

In [14]:
# Uncomment to retry failed downloads

# failed_orgs = set(failed_proteins['org_id'].tolist() + failed_experiments['org_id'].tolist())

# if failed_orgs:
#     print(f"Retrying {len(failed_orgs)} failed organisms...")
#     
#     for org_id in tqdm(failed_orgs, desc="Retrying"):
#         result = download_organism_data(org_id)
#         time.sleep(DELAY_BETWEEN_REQUESTS)
#     
#     print("✓ Retry complete")
# else:
#     print("No failed downloads to retry")

## Summary Statistics

In [15]:
# Calculate statistics
print("DOWNLOAD SUMMARY")
print("="*70)
print(f"\nOrganisms processed: {len(ORGANISMS)}")
print(f"\nProtein Sequences:")
print(f"  Downloaded: {summary_df['protein_success'].sum()} / {len(ORGANISMS)}")
print(f"  Success rate: {100 * summary_df['protein_success'].mean():.1f}%")
print(f"  Total size: {format_size(summary_df['protein_size'].sum())}")
print(f"  Average size: {format_size(summary_df['protein_size'].mean())}")

print(f"\nExperiment Metadata:")
print(f"  Downloaded: {summary_df['experiment_success'].sum()} / {len(ORGANISMS)}")
print(f"  Success rate: {100 * summary_df['experiment_success'].mean():.1f}%")
print(f"  Total size: {format_size(summary_df['experiment_size'].sum())}")
print(f"  Average size: {format_size(summary_df['experiment_size'].mean())}")

print(f"\nTotal Downloaded: {format_size(summary_df['protein_size'].sum() + summary_df['experiment_size'].sum())}")

both_success = (summary_df['protein_success'] & summary_df['experiment_success']).sum()
print(f"\nComplete downloads (both files): {both_success} / {len(ORGANISMS)}")
print(f"Complete success rate: {100 * both_success / len(ORGANISMS):.1f}%")

DOWNLOAD SUMMARY

Organisms processed: 41

Protein Sequences:
  Downloaded: 41 / 41
  Success rate: 100.0%
  Total size: 58.6 MB
  Average size: 1.4 MB

Experiment Metadata:
  Downloaded: 33 / 41
  Success rate: 80.5%
  Total size: 2.5 MB
  Average size: 63.1 KB

Total Downloaded: 61.1 MB

Complete downloads (both files): 33 / 41
Complete success rate: 80.5%


## Next Steps

After running this notebook:

1. **Check the logs**: Review `data/raw/logs/download_log.txt` for any issues
2. **Verify file counts**: Ensure you have files for all organisms
3. **Inspect data**: Open a few FASTA and TSV files to confirm they're valid
4. **Continue analysis**: Use the downloaded data in subsequent notebooks

**Files created**:
- `data/raw/protein_sequences/*.fasta` - Protein sequences (one per organism)
- `data/raw/experiment_metadata/*.tsv` - Experiment metadata (one per organism)
- `data/raw/logs/download_log.txt` - Detailed download log
- `data/raw/logs/download_summary_*.csv` - Summary CSV with results