# Extract Organism Metadata from Fitness Browser Database

**Purpose**: Create comprehensive organism metadata table for cross-referencing

**Data Source**: Fitness Browser SQLite database (feba.db)
- Database: https://figshare.com/articles/dataset/25236931
- Contains metadata for 57 organisms

**Output File**:
- `results/organism_metadata.csv` - Complete organism reference table

**Columns in output**:
- `orgId` - Short identifier (e.g., "Keio", "Cola")
- `Full_Species_Name` - With division prefix (e.g., "Gammaproteobacteria: Escherichia coli BW25113")
- `NCBI_TaxID` - NCBI taxonomy identifier
- `division` - Taxonomic division
- `genus` - Genus name
- `species` - Species name  
- `strain` - Strain identifier
- `Species_Name` - Simple format (e.g., "Escherichia coli BW25113")

**Last updated**: 2025-10-06

## Setup

In [None]:
import sqlite3
import pandas as pd
from pathlib import Path

print("Imports successful")

## Configuration

In [None]:
# Paths
DB_PATH = Path("../downloads/feba.db")
OUTPUT_DIR = Path("results")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

OUTPUT_FILE = OUTPUT_DIR / "organism_metadata.csv"

print(f"Configuration set")
print(f"  Database: {DB_PATH}")
print(f"  Output: {OUTPUT_FILE}")

# Check database exists
if not DB_PATH.exists():
    print(f"\nERROR: Database not found at {DB_PATH}")
    print(f"Please download feba.db from:")
    print(f"  https://figshare.com/articles/dataset/25236931")
else:
    print(f"\nDatabase found: {DB_PATH.stat().st_size / 1e9:.1f} GB")

## Connect to Database

In [None]:
print(f"Connecting to database: {DB_PATH}")
conn = sqlite3.connect(str(DB_PATH))
print("Connected successfully")

## Extract Organism Metadata

In [None]:
# Query organism table
query = """
SELECT 
    orgId,
    division,
    genus,
    species,
    strain,
    CAST(ncbiTaxonId AS INTEGER) as NCBI_TaxID
FROM Organism
ORDER BY orgId
"""

print("Extracting organism metadata...")
organisms = pd.read_sql_query(query, conn)

print(f"\nExtracted {len(organisms)} organisms")
print(f"\nFirst 5 organisms:")
print(organisms.head())

## Create Species Name Formats

In [None]:
# Create Full_Species_Name (with division prefix)
organisms['Full_Species_Name'] = (
    organisms['division'] + ': ' +
    organisms['genus'] + ' ' +
    organisms['species'] + ' ' +
    organisms['strain']
).str.strip()

# Create Species_Name (simple format, no division)
organisms['Species_Name'] = (
    organisms['genus'] + ' ' +
    organisms['species'] + ' ' +
    organisms['strain']
).str.strip()

print("Created species name formats")
print(f"\nExample formats for {organisms.iloc[0]['orgId']}:")
print(f"  orgId: {organisms.iloc[0]['orgId']}")
print(f"  Full_Species_Name: {organisms.iloc[0]['Full_Species_Name']}")
print(f"  Species_Name: {organisms.iloc[0]['Species_Name']}")
print(f"  NCBI_TaxID: {organisms.iloc[0]['NCBI_TaxID']}")

## Reorder Columns

In [None]:
# Reorder columns for better readability
column_order = [
    'orgId',
    'Full_Species_Name',
    'NCBI_TaxID',
    'division',
    'genus',
    'species',
    'strain',
    'Species_Name'
]

organisms = organisms[column_order]

print("Reordered columns:")
for col in organisms.columns:
    print(f"  - {col}")

## Display All Organisms

In [None]:
print("All organisms in Fitness Browser database:")
print("="*70)

for idx, row in organisms.iterrows():
    print(f"{row['orgId']:30s} {row['Species_Name']}")

print("="*70)
print(f"Total: {len(organisms)} organisms")

## Save Metadata Table

In [None]:
print(f"\nSaving organism metadata to: {OUTPUT_FILE}")
organisms.to_csv(OUTPUT_FILE, index=False)
print(f"Saved {len(organisms)} organisms with {len(organisms.columns)} columns")

print(f"\nColumns saved:")
for col in organisms.columns:
    print(f"  - {col}")

## Close Database Connection

In [None]:
conn.close()
print("Database connection closed")

## Summary

In [None]:
print("="*70)
print("ORGANISM METADATA EXTRACTION COMPLETE")
print("="*70)

print(f"\nData source:")
print(f"  Database: {DB_PATH}")
print(f"  Table: Organism")

print(f"\nOutput file:")
print(f"  {OUTPUT_FILE}")

print(f"\nMetadata summary:")
print(f"  Total organisms: {len(organisms)}")
print(f"  Unique divisions: {organisms['division'].nunique()}")
print(f"  Unique genera: {organisms['genus'].nunique()}")

print(f"\nDivision breakdown:")
division_counts = organisms['division'].value_counts()
for division, count in division_counts.items():
    print(f"  {division}: {count} organisms")

print(f"\nUsage:")
print(f"  - Cross-reference orgId ↔ Species_Name ↔ NCBI_TaxID")
print(f"  - Map between different naming conventions")
print(f"  - Look up taxonomic information")

print("\n" + "="*70)