# Extract Organism Metadata from Fitness Browser Database

**Purpose**: Create comprehensive organism metadata table for cross-referencing

**Data Source**: Fitness Browser SQLite database (feba.db)
- Database: https://figshare.com/articles/dataset/25236931
- Contains metadata for 57 organisms

**Output File**:
- `results/organism_metadata.csv` - Complete organism reference table

**Columns in output**:
- `orgId` - Short identifier (e.g., "Keio", "Cola")
- `Full_Species_Name` - With division prefix (e.g., "Gammaproteobacteria: Escherichia coli BW25113")
- `NCBI_TaxID` - NCBI taxonomy identifier
- `division` - Taxonomic division
- `genus` - Genus name
- `species` - Species name  
- `strain` - Strain identifier
- `Species_Name` - Simple format (e.g., "Escherichia coli BW25113")

**Last updated**: 2025-10-06

## Setup

In [1]:
import sqlite3
import pandas as pd
from pathlib import Path

print("Imports successful")

Imports successful


## Configuration

In [2]:
# Paths
DB_PATH = Path("../data/source/feba.db")  # Fixed: was ../downloads/feba.db
OUTPUT_DIR = Path("results")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

OUTPUT_FILE = OUTPUT_DIR / "organism_metadata.csv"

print(f"Configuration set")
print(f"  Database: {DB_PATH}")
print(f"  Output: {OUTPUT_FILE}")

# Check database exists
if not DB_PATH.exists():
    print(f"\nERROR: Database not found at {DB_PATH}")
    print(f"Please ensure feba.db is in ../data/source/")
else:
    print(f"\nDatabase found: {DB_PATH.stat().st_size / 1e9:.1f} GB")

Configuration set
  Database: ../data/source/feba.db
  Output: results/organism_metadata.csv

Database found: 8.6 GB


## Connect to Database

In [3]:
print(f"Connecting to database: {DB_PATH}")
conn = sqlite3.connect(str(DB_PATH))
print("Connected successfully")

Connecting to database: ../data/source/feba.db
Connected successfully


## Extract Organism Metadata

In [4]:
# Query organism table
query = """
SELECT 
    orgId,
    division,
    genus,
    species,
    strain,
    CAST(taxonomyId AS INTEGER) as NCBI_TaxID
FROM Organism
ORDER BY orgId
"""

print("Extracting organism metadata...")
organisms = pd.read_sql_query(query, conn)

print(f"\nExtracted {len(organisms)} organisms")
print(f"\nFirst 5 organisms:")
print(organisms.head())

Extracting organism metadata...

Extracted 57 organisms

First 5 organisms:
    orgId             division            genus           species  \
0    ANA3  Gammaproteobacteria       Shewanella               sp.   
1   BFirm   Betaproteobacteria     Burkholderia      phytofirmans   
2  Bifido       Actinobacteria  Bifidobacterium             breve   
3   Brev2  Alphaproteobacteria    Brevundimonas               sp.   
4  Btheta        Bacteroidetes      Bacteroides  thetaiotaomicron   

               strain  NCBI_TaxID  
0               ANA-3       94122  
1                PsJN      398527  
2             UCC2003      326426  
3  GW460-12-10-14-LB2       41275  
4            VPI-5482      226186  


## Create Species Name Formats

In [5]:
# Create Full_Species_Name (with division prefix)
organisms['Full_Species_Name'] = (
    organisms['division'] + ': ' +
    organisms['genus'] + ' ' +
    organisms['species'] + ' ' +
    organisms['strain']
).str.strip()

# Create Species_Name (simple format, no division)
organisms['Species_Name'] = (
    organisms['genus'] + ' ' +
    organisms['species'] + ' ' +
    organisms['strain']
).str.strip()

print("Created species name formats")
print(f"\nExample formats for {organisms.iloc[0]['orgId']}:")
print(f"  orgId: {organisms.iloc[0]['orgId']}")
print(f"  Full_Species_Name: {organisms.iloc[0]['Full_Species_Name']}")
print(f"  Species_Name: {organisms.iloc[0]['Species_Name']}")
print(f"  NCBI_TaxID: {organisms.iloc[0]['NCBI_TaxID']}")

Created species name formats

Example formats for ANA3:
  orgId: ANA3
  Full_Species_Name: Gammaproteobacteria: Shewanella sp. ANA-3
  Species_Name: Shewanella sp. ANA-3
  NCBI_TaxID: 94122


## Reorder Columns

In [6]:
# Reorder columns for better readability
column_order = [
    'orgId',
    'Full_Species_Name',
    'NCBI_TaxID',
    'division',
    'genus',
    'species',
    'strain',
    'Species_Name'
]

organisms = organisms[column_order]

print("Reordered columns:")
for col in organisms.columns:
    print(f"  - {col}")

Reordered columns:
  - orgId
  - Full_Species_Name
  - NCBI_TaxID
  - division
  - genus
  - species
  - strain
  - Species_Name


## Display All Organisms

In [7]:
print("All organisms in Fitness Browser database:")
print("="*70)

for idx, row in organisms.iterrows():
    print(f"{row['orgId']:30s} {row['Species_Name']}")

print("="*70)
print(f"Total: {len(organisms)} organisms")

All organisms in Fitness Browser database:
ANA3                           Shewanella sp. ANA-3
BFirm                          Burkholderia phytofirmans PsJN
Bifido                         Bifidobacterium breve UCC2003
Brev2                          Brevundimonas sp. GW460-12-10-14-LB2
Btheta                         Bacteroides thetaiotaomicron VPI-5482
Burk376                        Paraburkholderia bryophila 376MFSha3.1
Burkholderia_OAS925            Paraburkholderia graminis OAS925
Bvulgatus_CL09T03C04           Phocaeicola vulgatus CL09T03C04
CL21                           Ralstonia sp. UNC404CL21Col
Caulo                          Caulobacter crescentus NA1000
Cola                           Echinicola vietnamensis KMM 6221, DSM 17526
Cup4G11                        Cupriavidus basilensis FW507-4G11
Dda3937                        Dickeya dadantii 3937
Ddia6719                       Dickeya dianthicola 67-19
DdiaME23                       Dickeya dianthicola ME23
Dino                  

## Save Metadata Table

In [8]:
print(f"\nSaving organism metadata to: {OUTPUT_FILE}")
organisms.to_csv(OUTPUT_FILE, index=False)
print(f"Saved {len(organisms)} organisms with {len(organisms.columns)} columns")

print(f"\nColumns saved:")
for col in organisms.columns:
    print(f"  - {col}")


Saving organism metadata to: results/organism_metadata.csv
Saved 57 organisms with 8 columns

Columns saved:
  - orgId
  - Full_Species_Name
  - NCBI_TaxID
  - division
  - genus
  - species
  - strain
  - Species_Name


## Close Database Connection

In [9]:
conn.close()
print("Database connection closed")

Database connection closed


## Summary

In [10]:
print("="*70)
print("ORGANISM METADATA EXTRACTION COMPLETE")
print("="*70)

print(f"\nData source:")
print(f"  Database: {DB_PATH}")
print(f"  Table: Organism")

print(f"\nOutput file:")
print(f"  {OUTPUT_FILE}")

print(f"\nMetadata summary:")
print(f"  Total organisms: {len(organisms)}")
print(f"  Unique divisions: {organisms['division'].nunique()}")
print(f"  Unique genera: {organisms['genus'].nunique()}")

print(f"\nDivision breakdown:")
division_counts = organisms['division'].value_counts()
for division, count in division_counts.items():
    print(f"  {division}: {count} organisms")

print(f"\nUsage:")
print(f"  - Cross-reference orgId ↔ Species_Name ↔ NCBI_TaxID")
print(f"  - Map between different naming conventions")
print(f"  - Look up taxonomic information")

print("\n" + "="*70)

ORGANISM METADATA EXTRACTION COMPLETE

Data source:
  Database: ../data/source/feba.db
  Table: Organism

Output file:
  results/organism_metadata.csv

Metadata summary:
  Total organisms: 57
  Unique divisions: 8
  Unique genera: 36

Division breakdown:
  Gammaproteobacteria: 25 organisms
  Betaproteobacteria: 13 organisms
  Alphaproteobacteria: 8 organisms
  Bacteroidetes: 5 organisms
  Deltaproteobacteria: 2 organisms
  Euryarchaeota: 2 organisms
  Actinobacteria: 1 organisms
  Cyanobacteria: 1 organisms

Usage:
  - Cross-reference orgId ↔ Species_Name ↔ NCBI_TaxID
  - Map between different naming conventions
  - Look up taxonomic information

