# Web of Microbes (WoM) Data Ingestion into BERDL

This notebook loads the Web of Microbes exometabolomics database (Kosina et al. 2018,
BMC Microbiology) into BERDL as `kescience_webofmicrobes`.

**Data source**: SQLite database from webofmicrobes.org (archived 2018 snapshot)

**Tables**:
- `compound` — 589 metabolites (amino acids, nucleotides, sugars, unknowns)
- `environment` — 10 growth media/conditions
- `organism` — 37 organisms (incl. ENIGMA groundwater isolates)
- `project` — 5 published studies
- `observation` — 10,744 metabolite uptake/release assertions

**Pipeline**: TSV files → MinIO bronze → Delta Lake silver via `data_lakehouse_ingest`

## 1. Initialize Spark and MinIO clients

In [None]:
spark = get_spark_session()
minio_client = get_minio_client()

## 2. Upload TSV files and config to MinIO bronze layer

Upload the pre-exported TSV files and ingestion config JSON to the
bronze storage path on MinIO.

In [None]:
import os

LOCAL_DIR = "/home/psdehal/pangenome_science/BERIL-research-observatory/data/wom_ingest"
BUCKET = "cdm-lake"
BRONZE_PREFIX = "tenant-general-warehouse/kescience/datasets/webofmicrobes"

files_to_upload = [
    "compound.tsv",
    "environment.tsv",
    "organism.tsv",
    "project.tsv",
    "observation.tsv",
    "webofmicrobes.json",
]

for fname in files_to_upload:
    local_path = os.path.join(LOCAL_DIR, fname)
    remote_key = f"{BRONZE_PREFIX}/{fname}"
    fsize = os.path.getsize(local_path)
    
    minio_client.fput_object(BUCKET, remote_key, local_path)
    print(f"  Uploaded {fname} ({fsize:,} bytes) → s3a://{BUCKET}/{remote_key}")

print("\nAll files uploaded to bronze layer.")

## 3. Verify uploads

In [None]:
objects = minio_client.list_objects(BUCKET, prefix=BRONZE_PREFIX, recursive=True)
print(f"Objects in s3a://{BUCKET}/{BRONZE_PREFIX}/:\n")
for obj in objects:
    print(f"  {obj.object_name}  ({obj.size:,} bytes)")

## 4. Run ingestion pipeline

Load TSV data from bronze layer, apply schema, write Delta tables to silver layer,
and register as `kescience_webofmicrobes` namespace.

In [None]:
from data_lakehouse_ingest import ingest

cfg_path = f"s3a://{BUCKET}/{BRONZE_PREFIX}/webofmicrobes.json"
report = ingest(cfg_path)
report

## 5. Validate ingestion — sample queries

In [None]:
spark.sql("SHOW TABLES IN kescience_webofmicrobes").show()

In [None]:
spark.sql("""
    SELECT id, common_name 
    FROM kescience_webofmicrobes.organism 
    ORDER BY id
""").show(50, truncate=False)

In [None]:
# Summary: observations by action type
spark.sql("""
    SELECT 
        action,
        CASE action
            WHEN 'D' THEN 'Decreased'
            WHEN 'I' THEN 'Increased'
            WHEN 'N' THEN 'No change'
            WHEN 'E' THEN 'Excreted/Exported'
            ELSE 'Unknown'
        END as description,
        COUNT(*) as n_observations
    FROM kescience_webofmicrobes.observation
    GROUP BY action
    ORDER BY n_observations DESC
""").show()

In [None]:
# Check which WoM organisms might overlap with Fitness Browser
spark.sql("""
    SELECT 
        w.common_name as wom_organism,
        fb.orgId as fb_orgId,
        fb.genus as fb_genus,
        fb.species as fb_species,
        fb.strain as fb_strain
    FROM kescience_webofmicrobes.organism w
    LEFT JOIN kescience_fitnessbrowser.organism fb
        ON w.common_name LIKE CONCAT('%', fb.strain, '%')
    WHERE w.common_name LIKE '%FW%' OR w.common_name LIKE '%GW%'
    ORDER BY w.common_name
""").show(20, truncate=False)

In [None]:
# Top metabolites by number of organisms that consume them
spark.sql("""
    SELECT 
        c.compound_name,
        COUNT(DISTINCT obs.organism_id) as n_organisms_decrease,
        c.formula
    FROM kescience_webofmicrobes.observation obs
    JOIN kescience_webofmicrobes.compound c ON obs.compound_id = c.id
    WHERE obs.action = 'D'
    GROUP BY c.compound_name, c.formula
    ORDER BY n_organisms_decrease DESC
    LIMIT 20
""").show(20, truncate=False)

In [None]:
spark.stop()