# NB01: PHB Gene Discovery and Data Exploration

**Purpose**: Identify all polyhydroxybutyrate (PHB) pathway gene clusters across the pangenome (27K species, 293K genomes) and explore NMDC schema for metagenomic analysis.

**Requires**: BERDL JupyterHub (Spark session)

**Outputs**:
- `data/phb_gene_clusters.tsv` — all PHB pathway gene clusters with species, KEGG KO, core/aux status
- `data/phb_species_summary.tsv` — per-species PHB pathway completeness
- NMDC schema exploration results (inline)

## PHB Pathway Markers

| Gene | Function | KEGG KO | Specificity |
|------|----------|---------|-------------|
| phaC | PHA synthase (committed step) | K03821 | **PHB-specific** |
| phaP | Phasin (granule protein) | K14205 | PHB-specific |
| phaR | PHB transcriptional regulator | K18080 | PHB-specific |
| phaZ | PHB depolymerase | K05973 | PHB-specific |
| phaA | Beta-ketothiolase | K00626 | Shared with fatty acid metabolism |
| phaB | Acetoacetyl-CoA reductase | K00023 | Shared with SDR family |

In [1]:
# Initialize Spark session
# On BERDL JupyterHub — no import needed (injected into kernel)
spark = get_spark_session()
print(f'Spark session active: {spark.version}')

Spark session active: 4.0.1


In [2]:
import os
import pandas as pd
import numpy as np

# Project paths
PROJECT_DIR = os.path.expanduser('~/BERIL-research-observatory/projects/phb_granule_ecology')
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIG_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)
print(f'Project dir: {PROJECT_DIR}')

Project dir: /home/aparkin/BERIL-research-observatory/projects/phb_granule_ecology


## Part 1: PHB Gene Discovery in the Pangenome

Search `eggnog_mapper_annotations` for PHB pathway genes using KEGG KO identifiers.

**Important**: The KEGG_ko column can contain multiple KOs comma-separated, so we use LIKE patterns rather than exact matches.

In [3]:
# Define PHB pathway KEGG KOs
PHB_KOS = {
    'K03821': 'phaC - PHA synthase (committed step)',
    'K00023': 'phaB - acetoacetyl-CoA reductase',
    'K00626': 'phaA - beta-ketothiolase',
    'K05973': 'phaZ - PHB depolymerase',
    'K14205': 'phaP - phasin (granule protein)',
    'K18080': 'phaR - PHB transcriptional regulator',
}

# Also search by description keywords as a cross-check
PHB_KEYWORDS = [
    'polyhydroxybutyrate', 'polyhydroxyalkanoate', 'phasin',
    'pha synthase', 'phb synthase', 'poly-beta-hydroxybutyrate',
    'acetoacetyl-coa reductase',
]

print('PHB pathway markers defined')
for ko, desc in PHB_KOS.items():
    print(f'  {ko}: {desc}')

PHB pathway markers defined
  K03821: phaC - PHA synthase (committed step)
  K00023: phaB - acetoacetyl-CoA reductase
  K00626: phaA - beta-ketothiolase
  K05973: phaZ - PHB depolymerase
  K14205: phaP - phasin (granule protein)
  K18080: phaR - PHB transcriptional regulator


In [4]:
# Query 1: Find all gene clusters with PHB-related KEGG KOs
# Join eggnog_mapper_annotations with gene_cluster to get species and core/aux status

phb_clusters_df = spark.sql("""
    SELECT gc.gtdb_species_clade_id,
           gc.gene_cluster_id,
           gc.is_core,
           gc.is_auxiliary,
           gc.is_singleton,
           ann.KEGG_ko,
           ann.COG_category,
           ann.EC,
           ann.PFAMs,
           ann.Description
    FROM kbase_ke_pangenome.gene_cluster gc
    JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann
        ON gc.gene_cluster_id = ann.query_name
    WHERE ann.KEGG_ko LIKE '%K03821%'
       OR ann.KEGG_ko LIKE '%K00023%'
       OR ann.KEGG_ko LIKE '%K00626%'
       OR ann.KEGG_ko LIKE '%K05973%'
       OR ann.KEGG_ko LIKE '%K14205%'
       OR ann.KEGG_ko LIKE '%K18080%'
""")

# Cache for reuse
phb_clusters_df.cache()
n_clusters = phb_clusters_df.count()
print(f'Found {n_clusters:,} PHB-related gene clusters across all species')

Found 118,513 PHB-related gene clusters across all species


In [5]:
# Convert to pandas for analysis (should be manageable — one row per gene cluster per species)
phb_pd = phb_clusters_df.toPandas()
print(f'Shape: {phb_pd.shape}')
print(f'\nUnique species: {phb_pd["gtdb_species_clade_id"].nunique():,}')
print(f'\nKEGG KO distribution:')

# Parse which PHB KOs are present in each row
for ko, desc in PHB_KOS.items():
    mask = phb_pd['KEGG_ko'].str.contains(ko, na=False)
    n = mask.sum()
    n_species = phb_pd.loc[mask, 'gtdb_species_clade_id'].nunique()
    print(f'  {ko} ({desc.split(" - ")[0]}): {n:,} clusters in {n_species:,} species')

Shape: (118513, 10)

Unique species: 19,496

KEGG KO distribution:
  K03821 (phaC): 11,792 clusters in 6,067 species
  K00023 (phaB): 9,617 clusters in 6,977 species
  K00626 (phaA): 86,318 clusters in 17,969 species
  K05973 (phaZ): 4,656 clusters in 3,151 species
  K14205 (phaP): 6,130 clusters in 4,571 species
  K18080 (phaR): 0 clusters in 0 species


In [6]:
# Query 2: Cross-check with description-based search
# This catches annotations that might have PHB function but different KO assignments

desc_conditions = ' OR '.join(
    [f"LOWER(ann.Description) LIKE '%{kw}%'" for kw in PHB_KEYWORDS]
)

phb_desc_df = spark.sql(f"""
    SELECT ann.Description, ann.KEGG_ko, ann.COG_category, ann.PFAMs,
           COUNT(*) as n_clusters,
           COUNT(DISTINCT gc.gtdb_species_clade_id) as n_species
    FROM kbase_ke_pangenome.gene_cluster gc
    JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann
        ON gc.gene_cluster_id = ann.query_name
    WHERE {desc_conditions}
    GROUP BY ann.Description, ann.KEGG_ko, ann.COG_category, ann.PFAMs
    ORDER BY n_clusters DESC
""")

phb_desc_pd = phb_desc_df.toPandas()
print(f'Description-based search found {len(phb_desc_pd)} distinct annotation groups')
print(f'Total clusters: {phb_desc_pd["n_clusters"].sum():,}')
print(f'\nTop annotations:')
phb_desc_pd.head(20)

Description-based search found 56 distinct annotation groups
Total clusters: 17,324

Top annotations:


Unnamed: 0,Description,KEGG_ko,COG_category,PFAMs,n_clusters,n_species
0,Phasin protein,-,S,Phasin_2,5484,2793
1,Poly-beta-hydroxybutyrate polymerase (PhaC) N-...,ko:K03821,I,"Abhydrolase_1,PhaC_N",1441,1185
2,TIGRFAM phasin family protein,-,S,Phasin_2,1142,753
3,Acetoacetyl-CoA reductase,ko:K00023,IQ,adh_short_C2,1108,1066
4,Polyhydroxyalkanoate synthesis repressor PhaR,-,S,"PHB_acc,PHB_acc_N",814,798
5,Poly-beta-hydroxybutyrate polymerase (PhaC) N-...,ko:K03821,I,PhaC_N,768,668
6,Poly(hydroxyalcanoate) granule associated prot...,-,S,Phasin,737,673
7,Acetoacetyl-CoA reductase,ko:K00023,IQ,"adh_short,adh_short_C2",718,544
8,Poly-beta-hydroxybutyrate polymerase N terminal,ko:K03821,I,"PHBC_N,PhaC_N",556,450
9,Polyhydroxyalkanoate synthesis repressor,-,S,"PHB_acc,PHB_acc_N",448,443


In [7]:
# Assign each gene cluster a PHB gene label based on which KO it matches
# Priority: phaC > phaB > phaA > phaZ > phaP > phaR (most to least specific)

def assign_phb_gene(kegg_ko):
    """Assign PHB gene name based on KEGG KO content."""
    if pd.isna(kegg_ko):
        return 'unknown'
    ko = str(kegg_ko)
    if 'K03821' in ko: return 'phaC'
    if 'K05973' in ko: return 'phaZ'
    if 'K14205' in ko: return 'phaP'
    if 'K18080' in ko: return 'phaR'
    if 'K00023' in ko: return 'phaB'
    if 'K00626' in ko: return 'phaA'
    return 'unknown'

phb_pd['phb_gene'] = phb_pd['KEGG_ko'].apply(assign_phb_gene)
print('PHB gene assignments:')
print(phb_pd['phb_gene'].value_counts())

PHB gene assignments:
phb_gene
phaA    86318
phaC    11792
phaB     9617
phaP     6130
phaZ     4656
Name: count, dtype: int64


In [8]:
# Core/Accessory/Singleton status by PHB gene
status_summary = phb_pd.groupby('phb_gene').agg(
    n_clusters=('gene_cluster_id', 'count'),
    n_species=('gtdb_species_clade_id', 'nunique'),
    pct_core=('is_core', lambda x: (x == 1).mean() * 100),
    pct_aux=('is_auxiliary', lambda x: (x == 1).mean() * 100),
    pct_singleton=('is_singleton', lambda x: (x == 1).mean() * 100),
).round(1)

print('PHB gene clusters — core/accessory/singleton status:')
status_summary

PHB gene clusters — core/accessory/singleton status:


Unnamed: 0_level_0,n_clusters,n_species,pct_core,pct_aux,pct_singleton
phb_gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
phaA,86318,17969,62.2,37.8,26.3
phaB,9617,6977,68.0,32.0,21.0
phaC,11792,6067,70.7,29.3,19.4
phaP,6130,4571,73.3,26.7,16.7
phaZ,4656,3151,80.4,19.6,12.8


In [9]:
# Save gene cluster data
out_path = os.path.join(DATA_DIR, 'phb_gene_clusters.tsv')
phb_pd.to_csv(out_path, sep='\t', index=False)
print(f'Saved {len(phb_pd):,} rows to {out_path}')

Saved 118,513 rows to /home/aparkin/BERIL-research-observatory/projects/phb_granule_ecology/data/phb_gene_clusters.tsv


## Part 2: Species-Level PHB Pathway Completeness

Classify each species by PHB pathway status:
- **Complete**: has phaC + at least one of (phaA, phaB)
- **Synthase only**: has phaC but no phaA/phaB
- **Precursors only**: has phaA and/or phaB but no phaC
- **Absent**: no PHB pathway genes detected

In [10]:
# Aggregate to species level: which PHB genes does each species have?
species_genes = phb_pd.groupby('gtdb_species_clade_id')['phb_gene'].apply(set).reset_index()
species_genes.columns = ['gtdb_species_clade_id', 'phb_genes_present']

def classify_phb_status(gene_set):
    has_phaC = 'phaC' in gene_set
    has_phaA = 'phaA' in gene_set
    has_phaB = 'phaB' in gene_set
    if has_phaC and (has_phaA or has_phaB):
        return 'complete'
    elif has_phaC:
        return 'synthase_only'
    elif has_phaA or has_phaB:
        return 'precursors_only'
    else:
        return 'accessory_only'  # only phaP/phaR/phaZ

species_genes['phb_status'] = species_genes['phb_genes_present'].apply(classify_phb_status)
species_genes['phb_genes_str'] = species_genes['phb_genes_present'].apply(lambda s: ','.join(sorted(s)))

print('Species PHB pathway status:')
print(species_genes['phb_status'].value_counts())
print(f'\nTotal species with any PHB gene: {len(species_genes):,}')

Species PHB pathway status:
phb_status
precursors_only    12871
complete            6005
accessory_only       558
synthase_only         62
Name: count, dtype: int64

Total species with any PHB gene: 19,496


In [11]:
# Get total species count to calculate PHB-absent species
total_species = spark.sql("""
    SELECT COUNT(*) as n FROM kbase_ke_pangenome.pangenome
""").collect()[0]['n']

n_phb_any = len(species_genes)
n_phb_absent = total_species - n_phb_any

print(f'Total species in pangenome: {total_species:,}')
print(f'Species with any PHB gene: {n_phb_any:,} ({n_phb_any/total_species*100:.1f}%)')
print(f'Species with no PHB genes: {n_phb_absent:,} ({n_phb_absent/total_species*100:.1f}%)')

Total species in pangenome: 27,702
Species with any PHB gene: 19,496 (70.4%)
Species with no PHB genes: 8,206 (29.6%)


In [12]:
# Get phaC core/accessory status per species (is the synthase itself core or accessory?)
phac_status = phb_pd[phb_pd['phb_gene'] == 'phaC'].groupby('gtdb_species_clade_id').agg(
    n_phaC_clusters=('gene_cluster_id', 'count'),
    phaC_is_core=('is_core', 'max'),
    phaC_is_aux=('is_auxiliary', 'max'),
).reset_index()

# Merge with species-level summary
species_summary = species_genes.merge(phac_status, on='gtdb_species_clade_id', how='left')

print('\nAmong species with phaC:')
phac_species = species_summary[species_summary['phb_status'].isin(['complete', 'synthase_only'])]
print(f'  Total: {len(phac_species):,}')
print(f'  phaC is core: {(phac_species["phaC_is_core"] == 1).sum():,}')
print(f'  phaC is accessory: {(phac_species["phaC_is_aux"] == 1).sum():,}')


Among species with phaC:
  Total: 6,067
  phaC is core: 5,371
  phaC is accessory: 1,959


In [13]:
# Save species summary
out_path = os.path.join(DATA_DIR, 'phb_species_summary.tsv')
species_summary.to_csv(out_path, sep='\t', index=False)
print(f'Saved {len(species_summary):,} species to {out_path}')

Saved 19,496 species to /home/aparkin/BERIL-research-observatory/projects/phb_granule_ecology/data/phb_species_summary.tsv


## Part 3: NMDC Schema Exploration

Explore the `nmdc_arkin` database to understand what per-sample functional annotation data is available for the metagenomic arm of the analysis.

In [14]:
# List all tables in nmdc_arkin
nmdc_tables = spark.sql("SHOW TABLES IN nmdc_arkin").toPandas()
print(f'NMDC tables: {len(nmdc_tables)}')
nmdc_tables

NMDC tables: 63


Unnamed: 0,namespace,tableName,isTemporary
0,nmdc_arkin,annotation_terms_unified,False
1,nmdc_arkin,cog_categories,False
2,nmdc_arkin,cog_hierarchy_flat,False
3,nmdc_arkin,ec_hierarchy_flat,False
4,nmdc_arkin,ec_hierarchy_graph,False
...,...,...,...
58,nmdc_arkin,trait_unified,False
59,nmdc_arkin,annotation_hierarchies_unified,False
60,nmdc_arkin,abiotic_features,False
61,nmdc_arkin,metabolomics_gold,False


In [15]:
# Check trait_features columns — do any relate to PHB/PHA?
trait_schema = spark.sql("DESCRIBE nmdc_arkin.trait_features").toPandas()
print('trait_features columns:')
for _, row in trait_schema.iterrows():
    col = row['col_name']
    if any(kw in col.lower() for kw in ['pha', 'phb', 'poly', 'granule', 'storage', 'carbon', 'ferment']):
        print(f'  *** {col} ({row["data_type"]})')
    else:
        print(f'      {col} ({row["data_type"]})')

trait_features columns:
      cell_shape (double)
      oxygen_preference (double)
      functional_group:aerobic_chemoheterotrophy (double)
  *** functional_group:fermentation (double)
      functional_group:nitrate_denitrification (double)
      functional_group:nitrate_respiration (double)
      functional_group:nitrogen_fixation (double)
      functional_group:dark_thiosulfate_oxidation (double)
      functional_group:nitrate_reduction (double)
      functional_group:aromatic_compound_degradation (double)
  *** functional_group:aromatic_hydrocarbon_degradation (double)
      functional_group:arsenite_oxidation_energy_yielding (double)
      functional_group:cellulolysis (double)
      functional_group:dark_hydrogen_oxidation (double)
      functional_group:denitrification (double)
      functional_group:human_pathogens_all (double)
      functional_group:human_pathogens_pneumonia (double)
      functional_group:human_pathogens_septicemia (double)
      functional_group:invertebrate

In [16]:
# Check abiotic_features schema — what environmental measurements are available?
abiotic_schema = spark.sql("DESCRIBE nmdc_arkin.abiotic_features").toPandas()
print(f'abiotic_features columns ({len(abiotic_schema)}):')
for _, row in abiotic_schema.iterrows():
    print(f'  {row["col_name"]} ({row["data_type"]})')

abiotic_features columns (22):
  sample_id (string)
  annotations_ammonium_has_numeric_value (double)
  annotations_ammonium_nitrogen_has_numeric_value (double)
  annotations_calcium_has_numeric_value (double)
  annotations_carb_nitro_ratio_has_numeric_value (double)
  annotations_chlorophyll_has_numeric_value (double)
  annotations_conduc_has_numeric_value (double)
  annotations_depth_has_maximum_numeric_value (double)
  annotations_depth_has_minimum_numeric_value (double)
  annotations_depth_has_numeric_value (double)
  annotations_diss_org_carb_has_numeric_value (double)
  annotations_diss_oxygen_has_numeric_value (double)
  annotations_magnesium_has_numeric_value (double)
  annotations_manganese_has_numeric_value (double)
  annotations_ph (double)
  annotations_potassium_has_numeric_value (double)
  annotations_samp_size_has_numeric_value (double)
  annotations_soluble_react_phosp_has_numeric_value (double)
  annotations_temp_has_numeric_value (double)
  annotations_tot_nitro_conte

In [17]:
# Check study_table — what NMDC studies are available?
studies = spark.sql("SELECT * FROM nmdc_arkin.study_table").toPandas()
print(f'NMDC studies: {len(studies)}')
studies

NMDC studies: 48


Unnamed: 0,study_id,name,description,ecosystem,ecosystem_category,ecosystem_type,ecosystem_subtype,specific_ecosystem,principal_investigator_has_raw_value,principal_investigator_profile_image_url,...,part_of,principal_investigator_email,study_image,insdc_bioproject_identifiers,homepage_website,gnps_task_identifiers,jgi_portal_study_identifiers,notes,emsl_project_identifiers,alternative_names
0,nmdc:sty-11-8fb6t785,Deep subsurface shale carbon reservoir microbi...,This project aims to improve the understanding...,Environmental,Terrestrial,Deep subsurface,Unclassified,Unclassified,Kelly Wrighton,https://portal.nersc.gov/project/m3408/profile...,...,,,,,,,,,,
1,nmdc:sty-11-33fbta56,"Peatland microbial communities from Minnesota,...",This study is part of the Spruce and Peatland ...,Environmental,Aquatic,Freshwater,Wetlands,Unclassified,Christopher Schadt,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-cytnjc39""]",,,,,,,,,
2,nmdc:sty-11-aygzgv51,Riverbed sediment microbial communities from t...,This research project aimed to understand how ...,Environmental,Aquatic,Freshwater,River,Sediment,James Stegen,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-x4aawf73"", ""nmdc:sty-11-xcbexm97""]",,,,,,,,,
3,nmdc:sty-11-34xj1150,National Ecological Observatory Network: soil ...,This study contains the quality-controlled lab...,,,,,,Kate Thibault,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-nxrz9m96""]",kthibault@battelleecology.org,"[{""url"": ""https://portal.nersc.gov/project/m34...","[""bioproject:PRJNA406974"", ""bioproject:PRJNA10...","[""https://www.neonscience.org/""]",,,,,
4,nmdc:sty-11-076c9980,Lab enrichment of tropical soil microbial comm...,This study is part of the Microbes Persist: Sy...,Environmental,Terrestrial,Soil,Unclassified,Forest Soil,Jennifer Pett-Ridge,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-msexsy29""]",,,,,,,,,
5,nmdc:sty-11-t91cwb40,Determining the genomic basis for interactions...,The goal of this work is to develop the knowle...,,,,,,Michelle O'Malley,https://chemengr.ucsb.edu/sites/default/files/...,...,,momalley@engineering.ucsb.edu,,,,,,,,
6,nmdc:sty-11-5bgrvr62,Freshwater microbial communities from Lake Men...,The goal of this study is to examine long-term...,,,,,,Katherine McMahon,https://portal.nersc.gov/project/m3408/profile...,...,,tmcmahon@cae.wisc.edu,,,,,,,,
7,nmdc:sty-11-5tgfr349,Freshwater microbial communities from rivers f...,Streams and rivers represent key functioning u...,Environmental,Aquatic,Freshwater,River,Unclassified,Kelly Wrighton,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-x4aawf73"", ""nmdc:sty-11-xcbexm97""]",kwrighton@gmail.com,,,,,,,,
8,nmdc:sty-11-dcqce727,Bulk soil microbial communities from the East ...,This research project aimed to understand how ...,Environmental,Terrestrial,Soil,Meadow,Bulk soil,Eoin Brodie,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-2zhqs261""]",,,,,,,,,
9,nmdc:sty-11-1t150432,Populus root and rhizosphere microbial communi...,This study is part of the Plant-Microbe Interf...,Host-associated,Plants,Unclassified,Unclassified,Unclassified,Mitchel J. Doktycz,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-f1he1955""]",,,,,,,,,


In [18]:
# Check taxonomy_features — sample count and structure
tax_count = spark.sql("SELECT COUNT(*) as n FROM nmdc_arkin.taxonomy_features").collect()[0]['n']
tax_schema = spark.sql("DESCRIBE nmdc_arkin.taxonomy_features").toPandas()
print(f'taxonomy_features: {tax_count:,} samples, {len(tax_schema)} columns')
print('\nFirst 10 columns:')
tax_schema.head(10)

taxonomy_features: 6,365 samples, 3493 columns

First 10 columns:


Unnamed: 0,col_name,data_type,comment
0,7,double,
1,11,double,
2,33,double,
3,34,double,
4,35,double,
5,41,double,
6,43,double,
7,48,double,
8,52,double,
9,56,double,


In [19]:
# Check if there are per-sample functional annotation tables we might have missed
# Look for tables with 'annotation', 'ko', 'kegg', 'function', 'gene' in the name
annotation_tables = nmdc_tables[nmdc_tables['tableName'].str.contains(
    'annot|ko|kegg|function|gene|contig', case=False, na=False
)]
print('Potential per-sample annotation tables:')
for _, row in annotation_tables.iterrows():
    tname = row['tableName']
    try:
        cnt = spark.sql(f"SELECT COUNT(*) as n FROM nmdc_arkin.{tname}").collect()[0]['n']
        print(f'  {tname}: {cnt:,} rows')
    except Exception as e:
        print(f'  {tname}: error - {e}')

Potential per-sample annotation tables:
  annotation_terms_unified: 67,353 rows
  kegg_ko_module: 2,814 rows
  kegg_ko_pathway: 15,617 rows
  kegg_ko_terms: 8,104 rows
  kegg_module_terms: 370 rows
  kegg_pathway_terms: 306 rows
  contig_taxonomy: 3,981,010,222 rows
  contig_taxonomy_backup: 9,009,525,315 rows
  annotation_crossrefs: 46,861 rows
  annotation_hierarchies_unified: 75,181 rows


In [None]:
# Check KEGG KO terms — verify our PHB KOs exist in NMDC
phb_ko_list = "', '".join(PHB_KOS.keys())
nmdc_kos = spark.sql(f"""
    SELECT * FROM nmdc_arkin.kegg_ko_terms 
    WHERE ko_id IN ('{phb_ko_list}')
""").toPandas()
print('PHB KEGG KOs in NMDC reference:')
nmdc_kos

In [None]:
# Check metabolomics_gold schema and search for 3-hydroxybutyrate
metab_schema = spark.sql("DESCRIBE nmdc_arkin.metabolomics_gold").toPandas()
print('metabolomics_gold columns:')
print(metab_schema[['col_name', 'data_type']].to_string(index=False))

# The compound name column is "Compound Name" (with space and capitals)
# Also available: "name", "Traditional Name", "Common Name", "IUPAC Name"
# Use backticks for columns with spaces in SQL
name_col = '`Compound Name`'
hb_metabolites = spark.sql(f"""
    SELECT DISTINCT {name_col}, `Common Name`, `Traditional Name`, name, kegg, smiles
    FROM nmdc_arkin.metabolomics_gold
    WHERE LOWER({name_col}) LIKE '%hydroxybutyrat%'
       OR LOWER({name_col}) LIKE '%phb%'
       OR LOWER(name) LIKE '%hydroxybutyrat%'
       OR LOWER(`Common Name`) LIKE '%hydroxybutyrat%'
    LIMIT 20
""").toPandas()
print(f'\n3-hydroxybutyrate-related metabolites in NMDC:')
print(hb_metabolites)

In [None]:
# Check ncbi_env harmonized_name categories (for pangenome environment data)
env_categories = spark.sql("""
    SELECT harmonized_name, COUNT(*) as n
    FROM kbase_ke_pangenome.ncbi_env
    GROUP BY harmonized_name
    ORDER BY n DESC
""").toPandas()
print('NCBI environment metadata categories:')
env_categories

## Summary and Next Steps

### Pangenome Results
- Total PHB gene clusters found: ?
- Species with phaC (committed step): ?
- Species with complete pathway (phaC + phaA/B): ?
- phaC core vs accessory breakdown: ?

### NMDC Data Availability
- Per-sample functional annotation tables: ?
- Trait features relevant to PHB: ?
- Abiotic features available: ?
- 3-hydroxybutyrate in metabolomics: ?

### Next Notebook (NB02)
Map PHB pathway completeness across the GTDB tree using taxonomy data.