# NB01: NMDC Schema Exploration and Sample Inventory

**Project**: Community Metabolic Ecology via NMDC × Pangenome Integration  
**Requires**: BERDL JupyterHub (Spark — `get_spark_session()` injected into kernel)  

## Purpose

Gate notebook for the project. Before writing NB02–NB05, we must verify:

1. Column names in `nmdc_arkin.study_table`, `taxonomy_features`, and `metabolomics_gold`  
   (not fully documented in schema docs)
2. Whether `sample_id` values are prefixed with `study_id` (Query 1 depends on this)
3. Whether `taxonomy_features` provides relative abundances or raw read counts
4. Whether `metabolomics_gold` carries KEGG/ChEBI compound IDs in its own columns
5. Which taxonomy classifier (kraken_gold, centrifuge_gold, gottcha_gold, taxonomy_features)  
   provides the most species-level resolution

## Outputs

- `data/nmdc_sample_inventory.csv` — samples with paired taxonomy + metabolomics
- `data/nmdc_taxonomy_coverage.csv` — per-sample taxonomy stats per classifier
- `data/nmdc_metabolomics_coverage.csv` — per-sample compound counts and annotation rates
- `figures/nmdc_sample_coverage.png` — ecosystem type distribution and sample overlap

## Key Decisions to Make

- Which taxonomy classifier to use in NB02?
- Is normalization needed before computing community-weighted completeness (NB03)?
- How to map metabolomics compounds to amino acid identity (Plan A: KEGG IDs; Plan B: name matching)?

In [1]:
# On BERDL JupyterHub — get_spark_session() is injected into the kernel; no import needed
spark = get_spark_session()
spark

<pyspark.sql.connect.session.SparkSession at 0x7058448c56a0>

In [2]:
import os
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
# import matplotlib_venn  # pip install matplotlib-venn if needed

# Paths (relative to project root — adjust if running from a different cwd)
PROJECT_DIR = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
DATA_DIR = os.path.join(PROJECT_DIR, 'data')
FIGURES_DIR = os.path.join(PROJECT_DIR, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIGURES_DIR, exist_ok=True)
print(f'DATA_DIR: {DATA_DIR}')
print(f'FIGURES_DIR: {FIGURES_DIR}')

DATA_DIR: /home/cjneely/repos/BERIL-research-observatory/projects/nmdc_community_metabolic_ecology/data
FIGURES_DIR: /home/cjneely/repos/BERIL-research-observatory/projects/nmdc_community_metabolic_ecology/figures


---
## Part 1: Schema Verification

Verify column names before building any dependent queries.

In [3]:
# Step 1a: study_table — confirm study_id, ecosystem_category, ecosystem_type
print('=== nmdc_arkin.study_table schema ===')
spark.sql('DESCRIBE nmdc_arkin.study_table').show(50, truncate=False)

=== nmdc_arkin.study_table schema ===
+----------------------------------------+---------+-------+
|col_name                                |data_type|comment|
+----------------------------------------+---------+-------+
|study_id                                |string   |NULL   |
|name                                    |string   |NULL   |
|description                             |string   |NULL   |
|ecosystem                               |string   |NULL   |
|ecosystem_category                      |string   |NULL   |
|ecosystem_type                          |string   |NULL   |
|ecosystem_subtype                       |string   |NULL   |
|specific_ecosystem                      |string   |NULL   |
|principal_investigator_has_raw_value    |string   |NULL   |
|principal_investigator_profile_image_url|string   |NULL   |
|principal_investigator_orcid            |string   |NULL   |
|principal_investigator_type             |string   |NULL   |
|type                                    |strin

In [4]:
# Step 1b: taxonomy_features — confirm sample_id column, taxon/abundance columns
print('=== nmdc_arkin.taxonomy_features schema ===')
spark.sql('DESCRIBE nmdc_arkin.taxonomy_features').show(50, truncate=False)

=== nmdc_arkin.taxonomy_features schema ===
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|7       |double   |NULL   |
|11      |double   |NULL   |
|33      |double   |NULL   |
|34      |double   |NULL   |
|35      |double   |NULL   |
|41      |double   |NULL   |
|43      |double   |NULL   |
|48      |double   |NULL   |
|52      |double   |NULL   |
|56      |double   |NULL   |
|69      |double   |NULL   |
|114     |double   |NULL   |
|119     |double   |NULL   |
|125     |double   |NULL   |
|128     |double   |NULL   |
|192     |double   |NULL   |
|193     |double   |NULL   |
|266     |double   |NULL   |
|280     |double   |NULL   |
|287     |double   |NULL   |
|292     |double   |NULL   |
|293     |double   |NULL   |
|294     |double   |NULL   |
|300     |double   |NULL   |
|303     |double   |NULL   |
|305     |double   |NULL   |
|316     |double   |NULL   |
|317     |double   |NULL   |
|329     |double   |NULL   |
|337     |double   |NULL   |

In [5]:
# Step 1c: metabolomics_gold — look for KEGG/ChEBI compound ID columns
print('=== nmdc_arkin.metabolomics_gold schema ===')
spark.sql('DESCRIBE nmdc_arkin.metabolomics_gold').show(50, truncate=False)

=== nmdc_arkin.metabolomics_gold schema ===
+--------------------------------------------+---------+-------+
|col_name                                    |data_type|comment|
+--------------------------------------------+---------+-------+
|file_id                                     |string   |NULL   |
|file_name                                   |string   |NULL   |
|feature_id                                  |string   |NULL   |
|Apex Scan Number                            |double   |NULL   |
|Area                                        |double   |NULL   |
|Associated Mass Features after Deconvolution|string   |NULL   |
|Calculated m/z                              |double   |NULL   |
|Confidence Score                            |double   |NULL   |
|Dispersity Index                            |double   |NULL   |
|Entropy Similarity                          |double   |NULL   |
|Intensity                                   |double   |NULL   |
|Ion Formula                                 |

In [None]:
# Step 1d: Inspect raw values to understand ID formats and join strategy
print('=== Sample study_table rows ===')
spark.sql("""
    SELECT study_id, ecosystem_category, ecosystem_type, ecosystem_subtype
    FROM nmdc_arkin.study_table
    LIMIT 5
""").show(truncate=False)

# taxonomy_features has all-numeric column names — it is a WIDE-FORMAT matrix.
# Count rows to determine orientation (rows = samples? or rows = taxons?)
print('\n=== taxonomy_features row count and first 3 rows ===')
n_tax_rows = spark.sql("SELECT COUNT(*) as n FROM nmdc_arkin.taxonomy_features").collect()[0]['n']
print(f'Row count: {n_tax_rows}')
# Show first few rows — use backtick-quoting for numeric column names
spark.sql("SELECT * FROM nmdc_arkin.taxonomy_features LIMIT 3").show(5, truncate=True)

# metabolomics_gold uses file_id, not sample_id
print('\n=== Sample metabolomics_gold rows (file_id, name, kegg, chebi) ===')
spark.sql("""
    SELECT file_id, file_name, `name`, kegg, chebi, `Molecular Formula`
    FROM nmdc_arkin.metabolomics_gold
    LIMIT 5
""").show(truncate=False)

In [7]:
# Step 1e: Check taxonomy classifiers for available rank columns
for tbl in ['kraken_gold', 'centrifuge_gold', 'gottcha_gold']:
    try:
        print(f'\n=== nmdc_arkin.{tbl} schema ===')
        spark.sql(f'DESCRIBE nmdc_arkin.{tbl}').show(30, truncate=False)
    except Exception as e:
        print(f'  ERROR describing {tbl}: {e}')


=== nmdc_arkin.kraken_gold schema ===
+-------------+---------+-------+
|col_name     |data_type|comment|
+-------------+---------+-------+
|percent      |float    |NULL   |
|clade_reads  |int      |NULL   |
|direct_reads |int      |NULL   |
|taxid        |int      |NULL   |
|name         |string   |NULL   |
|file_id      |string   |NULL   |
|file_name    |string   |NULL   |
|rank         |string   |NULL   |
|taxid_lineage|string   |NULL   |
|lineage      |string   |NULL   |
|abundance    |float    |NULL   |
|abundance_clr|float    |NULL   |
|rn           |bigint   |NULL   |
+-------------+---------+-------+


=== nmdc_arkin.centrifuge_gold schema ===
+-------------+---------+-------+
|col_name     |data_type|comment|
+-------------+---------+-------+
|file_id      |string   |NULL   |
|file_name    |string   |NULL   |
|rank         |string   |NULL   |
|taxid        |bigint   |NULL   |
|taxid_lineage|string   |NULL   |
|lineage      |string   |NULL   |
|label        |string   |NULL   |

---
## Part 2: Study and Sample Inventory

How many studies exist? What ecosystems are covered? How many samples have paired taxonomy + metabolomics?

In [8]:
# All studies with row count
study_df = spark.sql("""
    SELECT *
    FROM nmdc_arkin.study_table
""").toPandas()

print(f'Total studies: {len(study_df)}')
print(study_df.columns.tolist())
study_df.head(10)

Total studies: 48
['study_id', 'name', 'description', 'ecosystem', 'ecosystem_category', 'ecosystem_type', 'ecosystem_subtype', 'specific_ecosystem', 'principal_investigator_has_raw_value', 'principal_investigator_profile_image_url', 'principal_investigator_orcid', 'principal_investigator_type', 'type', 'funding_sources', 'has_credit_associations', 'gold_study_identifiers', 'title', 'study_category', 'associated_dois', 'protocol_link', 'principal_investigator_name', 'websites', 'part_of', 'principal_investigator_email', 'study_image', 'insdc_bioproject_identifiers', 'homepage_website', 'gnps_task_identifiers', 'jgi_portal_study_identifiers', 'notes', 'emsl_project_identifiers', 'alternative_names']


Unnamed: 0,study_id,name,description,ecosystem,ecosystem_category,ecosystem_type,ecosystem_subtype,specific_ecosystem,principal_investigator_has_raw_value,principal_investigator_profile_image_url,...,part_of,principal_investigator_email,study_image,insdc_bioproject_identifiers,homepage_website,gnps_task_identifiers,jgi_portal_study_identifiers,notes,emsl_project_identifiers,alternative_names
0,nmdc:sty-11-8fb6t785,Deep subsurface shale carbon reservoir microbi...,This project aims to improve the understanding...,Environmental,Terrestrial,Deep subsurface,Unclassified,Unclassified,Kelly Wrighton,https://portal.nersc.gov/project/m3408/profile...,...,,,,,,,,,,
1,nmdc:sty-11-33fbta56,"Peatland microbial communities from Minnesota,...",This study is part of the Spruce and Peatland ...,Environmental,Aquatic,Freshwater,Wetlands,Unclassified,Christopher Schadt,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-cytnjc39""]",,,,,,,,,
2,nmdc:sty-11-aygzgv51,Riverbed sediment microbial communities from t...,This research project aimed to understand how ...,Environmental,Aquatic,Freshwater,River,Sediment,James Stegen,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-x4aawf73"", ""nmdc:sty-11-xcbexm97""]",,,,,,,,,
3,nmdc:sty-11-34xj1150,National Ecological Observatory Network: soil ...,This study contains the quality-controlled lab...,,,,,,Kate Thibault,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-nxrz9m96""]",kthibault@battelleecology.org,"[{""url"": ""https://portal.nersc.gov/project/m34...","[""bioproject:PRJNA406974"", ""bioproject:PRJNA10...","[""https://www.neonscience.org/""]",,,,,
4,nmdc:sty-11-076c9980,Lab enrichment of tropical soil microbial comm...,This study is part of the Microbes Persist: Sy...,Environmental,Terrestrial,Soil,Unclassified,Forest Soil,Jennifer Pett-Ridge,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-msexsy29""]",,,,,,,,,
5,nmdc:sty-11-t91cwb40,Determining the genomic basis for interactions...,The goal of this work is to develop the knowle...,,,,,,Michelle O'Malley,https://chemengr.ucsb.edu/sites/default/files/...,...,,momalley@engineering.ucsb.edu,,,,,,,,
6,nmdc:sty-11-5bgrvr62,Freshwater microbial communities from Lake Men...,The goal of this study is to examine long-term...,,,,,,Katherine McMahon,https://portal.nersc.gov/project/m3408/profile...,...,,tmcmahon@cae.wisc.edu,,,,,,,,
7,nmdc:sty-11-5tgfr349,Freshwater microbial communities from rivers f...,Streams and rivers represent key functioning u...,Environmental,Aquatic,Freshwater,River,Unclassified,Kelly Wrighton,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-x4aawf73"", ""nmdc:sty-11-xcbexm97""]",kwrighton@gmail.com,,,,,,,,
8,nmdc:sty-11-dcqce727,Bulk soil microbial communities from the East ...,This research project aimed to understand how ...,Environmental,Terrestrial,Soil,Meadow,Bulk soil,Eoin Brodie,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-2zhqs261""]",,,,,,,,,
9,nmdc:sty-11-1t150432,Populus root and rhizosphere microbial communi...,This study is part of the Plant-Microbe Interf...,Host-associated,Plants,Unclassified,Unclassified,Unclassified,Mitchel J. Doktycz,https://portal.nersc.gov/project/m3408/profile...,...,"[""nmdc:sty-11-f1he1955""]",,,,,,,,,


In [None]:
# Count samples with taxonomy data
# taxonomy_features is wide-format (rows = samples based on 6,365 expected samples)
# — count rows directly; no sample_id column
n_tax_samples = spark.sql("""
    SELECT COUNT(*) as n FROM nmdc_arkin.taxonomy_features
""").collect()[0]['n']

# Classifier tables (tidy format) use file_id, not sample_id
n_kraken_files = spark.sql("""
    SELECT COUNT(DISTINCT file_id) as n FROM nmdc_arkin.kraken_gold
""").collect()[0]['n']

# metabolomics_gold uses file_id, not sample_id
n_met_files = spark.sql("""
    SELECT COUNT(DISTINCT file_id) as n FROM nmdc_arkin.metabolomics_gold
""").collect()[0]['n']

print(f'taxonomy_features row count (likely = n_samples): {n_tax_samples}')
print(f'kraken_gold distinct file_ids:                    {n_kraken_files}')
print(f'metabolomics_gold distinct file_ids:              {n_met_files}')

In [None]:
# Files with BOTH taxonomy (kraken_gold) AND metabolomics data
# Both tables use file_id as identifier — join on that
# taxonomy_features is wide-format and cannot be joined this way; use classifier tables instead
overlap_df = spark.sql("""
    SELECT k.file_id, k.file_name
    FROM (SELECT DISTINCT file_id, file_name FROM nmdc_arkin.kraken_gold) k
    JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
        ON k.file_id = m.file_id
""").toPandas()

print(f'Files with BOTH kraken taxonomy AND metabolomics: {len(overlap_df)}')
overlap_files = set(overlap_df['file_id'])

# Also check centrifuge and gottcha overlap with metabolomics
for clf in ['centrifuge_gold', 'gottcha_gold']:
    try:
        n = spark.sql(f"""
            SELECT COUNT(DISTINCT c.file_id)
            FROM (SELECT DISTINCT file_id FROM nmdc_arkin.{clf}) c
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
                ON c.file_id = m.file_id
        """).collect()[0][0]
        print(f'Files with {clf} AND metabolomics: {n}')
    except Exception as e:
        print(f'  {clf}: error — {e}')

In [None]:
# Inspect file_id and file_name formats to understand study → file linkage
print('Overlap file_id examples (first 10):')
print(overlap_df['file_id'].head(10).tolist())
print('\nOverlap file_name examples (first 10):')
print(overlap_df['file_name'].head(10).tolist())
print('\nStudy_id examples (from study_table):')
print(study_df['study_id'].head(5).tolist())

# Check if file_name contains study_id as a prefix or substring
# NMDC study IDs look like: nmdc:sty-11-XXXXXXXX
# File IDs may look like:   nmdc:dobj-XXXXXXXX
# Try to find the connection between them
print('\n=== Check if file_name hints at study association ===')
spark.sql("""
    SELECT file_id, file_name, COUNT(*) as n_rows
    FROM nmdc_arkin.kraken_gold
    GROUP BY file_id, file_name
    ORDER BY n_rows DESC
    LIMIT 10
""").show(truncate=False)

In [None]:
# Abiotic features — verify the ID column name first (may be sample_id OR file_id)
print('=== nmdc_arkin.abiotic_features schema ===')
abiotic_schema = spark.sql('DESCRIBE nmdc_arkin.abiotic_features').toPandas()
id_cols = abiotic_schema[abiotic_schema['col_name'].isin(['sample_id', 'file_id'])]
print(id_cols[['col_name', 'data_type']].to_string())
print()

# Determine which ID column to use
if 'file_id' in abiotic_schema['col_name'].values:
    abiotic_id_col = 'file_id'
elif 'sample_id' in abiotic_schema['col_name'].values:
    abiotic_id_col = 'sample_id'
else:
    abiotic_id_col = None
    print('WARNING: Neither file_id nor sample_id found in abiotic_features!')
    print('Available columns:', abiotic_schema['col_name'].tolist()[:10])

print(f'Using abiotic_features ID column: {abiotic_id_col}')

abiotic_df = spark.sql('SELECT * FROM nmdc_arkin.abiotic_features').toPandas()
print(f'\nTotal abiotic_features rows: {len(abiotic_df)}')

if abiotic_id_col:
    abiotic_overlap = abiotic_df[abiotic_df[abiotic_id_col].isin(overlap_files)]
    print(f'Abiotic features for overlap files: {len(abiotic_overlap)}')
else:
    abiotic_overlap = pd.DataFrame()
    print('Cannot filter abiotic_features — unknown ID column.')

print('Abiotic columns:', abiotic_df.columns.tolist())

---
## Part 3: Taxonomy Classifier Comparison

Which classifier (kraken_gold, centrifuge_gold, gottcha_gold, taxonomy_features) provides the most species-level resolution in overlap samples?

In [None]:
# Classifier schema summary (from Part 1):
# kraken_gold:    file_id, file_name, rank, name (taxon name), taxid, abundance, abundance_clr
# centrifuge_gold: file_id, file_name, rank, label (taxon name), taxid, abundance, abundance_clr
# gottcha_gold:   file_id, file_name, rank, label (taxon name), taxid, abundance, abundance_clr

# For each classifier: what fraction of rows are at species rank?
# and how many overlap files have data?

classifiers = {
    'kraken_gold':     {'name_col': 'name',  'abund_col': 'abundance'},
    'centrifuge_gold': {'name_col': 'label', 'abund_col': 'abundance'},
    'gottcha_gold':    {'name_col': 'label', 'abund_col': 'abundance'},
}

clf_stats = []
for tbl, cols in classifiers.items():
    try:
        stats = spark.sql(f"""
            SELECT
                COUNT(*) as n_total_rows,
                COUNT(DISTINCT file_id) as n_files,
                SUM(CASE WHEN LOWER(rank) = 'species' THEN 1 ELSE 0 END) as n_species_rows,
                COUNT(DISTINCT CASE WHEN LOWER(rank) = 'species' THEN file_id END) as n_files_with_species
            FROM nmdc_arkin.{tbl}
        """).toPandas()
        stats['classifier'] = tbl
        stats['species_row_frac'] = stats['n_species_rows'] / stats['n_total_rows']

        # overlap with metabolomics
        n_overlap = spark.sql(f"""
            SELECT COUNT(DISTINCT c.file_id)
            FROM (SELECT DISTINCT file_id FROM nmdc_arkin.{tbl} WHERE LOWER(rank) = 'species') c
            JOIN (SELECT DISTINCT file_id FROM nmdc_arkin.metabolomics_gold) m
                ON c.file_id = m.file_id
        """).collect()[0][0]
        stats['n_overlap_with_metabolomics'] = n_overlap

        clf_stats.append(stats)
        print(f'{tbl}: {stats["n_files"].iloc[0]} files, '
              f'{stats["species_row_frac"].iloc[0]:.1%} rows at species rank, '
              f'{n_overlap} overlap with metabolomics')
    except Exception as e:
        print(f'  ERROR for {tbl}: {e}')

if clf_stats:
    import pandas as pd
    clf_summary = pd.concat(clf_stats, ignore_index=True)
    display(clf_summary[['classifier', 'n_files', 'n_species_rows', 'species_row_frac',
                          'n_files_with_species', 'n_overlap_with_metabolomics']])

In [None]:
# Column name reference for NB02 (confirmed from schema verification above)
#
# kraken_gold:
#   RANK_COL   = 'rank'       # values: 'species', 'genus', 'family', ...
#   TAXON_COL  = 'name'       # taxon name string (e.g., 'Escherichia coli')
#   ABUND_COL  = 'abundance'  # pre-normalized relative abundance (float)
#   TAXID_COL  = 'taxid'      # NCBI taxid (int)
#   ID_COL     = 'file_id'    # NOT sample_id
#
# centrifuge_gold / gottcha_gold:
#   RANK_COL   = 'rank'
#   TAXON_COL  = 'label'      # NOTE: 'label', not 'name'
#   ABUND_COL  = 'abundance'
#   ID_COL     = 'file_id'
#
# metabolomics_gold:
#   ID_COL         = 'file_id'
#   COMPOUND_NAME  = 'name'
#   KEGG_COL       = 'kegg'   # string (KEGG compound ID, e.g., 'C00041')
#   CHEBI_COL      = 'chebi'  # double (ChEBI ID)
#   INCHI_COL      = 'inchi'
#   ABUND_COL      = 'Area'   # or 'Intensity' — TBD after inspecting values

print('Column reference confirmed. No TODOs remaining for schema.')

---
## Part 4: Metabolomics Coverage

What fraction of measured compounds carry KEGG/ChEBI compound IDs (needed for amino acid matching in NB04)?

In [None]:
# Per-file compound counts and intensity stats
# metabolomics_gold uses file_id (not sample_id)
met_file_stats = spark.sql("""
    SELECT
        file_id,
        COUNT(*) as n_features_total,
        COUNT(DISTINCT `name`) as n_named_compounds,
        SUM(CASE WHEN kegg IS NOT NULL AND kegg != '' THEN 1 ELSE 0 END) as n_kegg_annotated,
        SUM(CASE WHEN chebi IS NOT NULL THEN 1 ELSE 0 END) as n_chebi_annotated
    FROM nmdc_arkin.metabolomics_gold
    GROUP BY file_id
""").toPandas()

print(f'Files with metabolomics: {len(met_file_stats)}')
print(f'Features per file — median: {met_file_stats["n_features_total"].median():.0f}, '
      f'max: {met_file_stats["n_features_total"].max()}')
print(f'KEGG annotation rate (mean across files): '
      f'{(met_file_stats["n_kegg_annotated"] / met_file_stats["n_features_total"]).mean():.1%}')
print(f'ChEBI annotation rate (mean across files): '
      f'{(met_file_stats["n_chebi_annotated"] / met_file_stats["n_features_total"]).mean():.1%}')
met_file_stats.describe()

In [None]:
# Confirmed from DESCRIBE: metabolomics_gold has kegg (string), chebi (double), name (string)
# Compute overall annotation rate across all rows
annotation_summary = spark.sql("""
    SELECT
        COUNT(*) as n_total,
        SUM(CASE WHEN kegg IS NOT NULL AND kegg != '' THEN 1 ELSE 0 END) as n_kegg,
        SUM(CASE WHEN chebi IS NOT NULL THEN 1 ELSE 0 END) as n_chebi,
        SUM(CASE WHEN `name` IS NOT NULL AND `name` != '' THEN 1 ELSE 0 END) as n_named,
        COUNT(DISTINCT `name`) as n_unique_names,
        COUNT(DISTINCT kegg) as n_unique_kegg,
        COUNT(DISTINCT CAST(chebi AS BIGINT)) as n_unique_chebi
    FROM nmdc_arkin.metabolomics_gold
    WHERE kegg IS NOT NULL OR chebi IS NOT NULL OR `name` IS NOT NULL
""").toPandas()

print('Metabolomics annotation summary (over annotated rows):')
for col in annotation_summary.columns:
    print(f'  {col}: {annotation_summary[col].iloc[0]}')

# Compute rates against total
total = spark.sql("SELECT COUNT(*) as n FROM nmdc_arkin.metabolomics_gold").collect()[0]['n']
print(f'\nTotal rows: {total}')
print(f'KEGG annotation rate: {annotation_summary["n_kegg"].iloc[0] / total:.1%}')
print(f'ChEBI annotation rate: {annotation_summary["n_chebi"].iloc[0] / total:.1%}')

In [None]:
# Inspect sample rows to understand metabolomics values and units
spark.sql("""
    SELECT file_id, `name`, kegg, chebi, Area, Intensity, `Molecular Formula`, `Ion Formula`
    FROM nmdc_arkin.metabolomics_gold
    WHERE kegg IS NOT NULL AND kegg != ''
    LIMIT 10
""").show(truncate=False)

In [None]:
# Search for amino acid compounds using confirmed 'name' column
AMINO_ACIDS = [
    'alanine', 'arginine', 'asparagine', 'aspartate', 'aspartic acid',
    'cysteine', 'glutamate', 'glutamic acid', 'glutamine', 'glycine',
    'histidine', 'isoleucine', 'leucine', 'lysine', 'methionine',
    'phenylalanine', 'proline', 'serine', 'threonine', 'tryptophan',
    'tyrosine', 'valine'
]
aa_pattern = '|'.join(AMINO_ACIDS)

aa_hits = spark.sql(f"""
    SELECT `name`, kegg, COUNT(DISTINCT file_id) as n_files,
           AVG(Area) as mean_area
    FROM nmdc_arkin.metabolomics_gold
    WHERE LOWER(`name`) RLIKE '{aa_pattern}'
    GROUP BY `name`, kegg
    ORDER BY n_files DESC
""").toPandas()

print(f'Amino acid compound hits: {len(aa_hits)}')
print(f'Files covering at least one AA compound: {aa_hits["n_files"].max()}')
print()
print(aa_hits.to_string())

---
## Part 5: Save Outputs and Figures

In [None]:
# Save file inventory (files with both classifier taxonomy AND metabolomics)
inventory = overlap_df.copy()  # columns: file_id, file_name
inventory['has_taxonomy'] = True
inventory['has_metabolomics'] = True

# Merge abiotic features for overlap files (if ID column was found)
if len(abiotic_overlap) > 0:
    merge_col = abiotic_id_col  # 'file_id' or 'sample_id'
    if merge_col in inventory.columns and merge_col in abiotic_overlap.columns:
        inventory = inventory.merge(abiotic_overlap, on=merge_col, how='left')
    else:
        print(f'WARNING: Cannot merge abiotic data — ID column mismatch '
              f'(inventory has {inventory.columns.tolist()[:4]}, '
              f'abiotic has {abiotic_overlap.columns.tolist()[:4]})')

inventory.to_csv(os.path.join(DATA_DIR, 'nmdc_sample_inventory.csv'), index=False)
print(f'Saved: data/nmdc_sample_inventory.csv ({len(inventory)} rows)')
print('Columns:', inventory.columns.tolist())

In [None]:
# Save metabolomics coverage stats for overlap files
met_overlap = met_file_stats[met_file_stats['file_id'].isin(overlap_files)].copy()
met_overlap.to_csv(os.path.join(DATA_DIR, 'nmdc_metabolomics_coverage.csv'), index=False)
print(f'Saved: data/nmdc_metabolomics_coverage.csv ({len(met_overlap)} rows)')

# Save classifier comparison summary
if clf_stats:
    clf_summary.to_csv(os.path.join(DATA_DIR, 'nmdc_taxonomy_coverage.csv'), index=False)
    print(f'Saved: data/nmdc_taxonomy_coverage.csv ({len(clf_summary)} rows)')

In [None]:
# Figure: Sample counts by data type and metabolomics compound distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('NMDC Sample Coverage', fontsize=14)

# Panel 1: File counts by data availability
ax = axes[0]
counts = {
    'Taxonomy\n(kraken) only': n_kraken_files - len(overlap_files),
    'Metabolomics\nonly': n_met_files - len(overlap_files),
    'Both\n(overlap)': len(overlap_files)
}
bars = ax.bar(list(counts.keys()), list(counts.values()),
              color=['#4C9BE8', '#E88C4C', '#6EC46E'], edgecolor='white')
ax.set_ylabel('Number of files')
ax.set_title('File data availability\n(kraken classifier)')
for bar, val in zip(bars, counts.values()):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1,
            str(val), ha='center', va='bottom', fontsize=10)

# Panel 2: Metabolomics feature count distribution for overlap files
ax2 = axes[1]
if len(met_overlap) > 0:
    ax2.hist(met_overlap['n_features_total'], bins=30, color='#6EC46E', edgecolor='white')
    ax2.set_xlabel('Features per file')
    ax2.set_ylabel('Number of files')
    ax2.set_title('Metabolomics feature counts\n(overlap files)')
    median_val = met_overlap['n_features_total'].median()
    ax2.axvline(median_val, color='black', linestyle='--',
                label=f'Median: {median_val:.0f}')
    ax2.legend()
else:
    ax2.text(0.5, 0.5, 'No overlap files found', ha='center', va='center',
             transform=ax2.transAxes)

plt.tight_layout()
fig_path = os.path.join(FIGURES_DIR, 'nmdc_sample_coverage.png')
plt.savefig(fig_path, dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/nmdc_sample_coverage.png')

---
## Summary and Decisions for NB02

Fill in after running all cells:

| Question | Finding |
|---|---|
| Files with taxonomy (kraken) | ??? |
| Files with metabolomics | ??? |
| Overlap (both) | ??? |
| Best taxonomy classifier | ??? (kraken / centrifuge / gottcha) |
| Taxonomy resolution | Species-level available (`rank = 'species'` confirmed) |
| Taxonomy abundance column | `abundance` (float — likely pre-normalized; verify) |
| Metabolomics compound name column | `name` (string) ✓ confirmed |
| Metabolomics KEGG ID column | `kegg` (string) ✓ confirmed |
| Metabolomics ChEBI ID column | `chebi` (double) ✓ confirmed |
| Metabolomics abundance column | `Area` or `Intensity` — inspect cell-21 output |
| KEGG annotation rate | ???% |
| ChEBI annotation rate | ???% |
| Amino acid compounds found? | ??? compound names matching AA list |
| File → Study join strategy | file_name parsing OR separate lookup table (verify in cell-13) |
| `taxonomy_features` structure | Wide matrix — columns are NCBI taxon IDs; investigate row count vs 6,365 |
| `abiotic_features` ID column | `sample_id` OR `file_id` — confirmed in cell-14 |

**Confirmed column names for NB02**:
- Classifiers: `file_id`, `rank`, `name` (kraken) or `label` (centrifuge/gottcha), `abundance`
- Metabolomics: `file_id`, `name`, `kegg`, `chebi`, `Area`

**Decision for NB02**: Use `___` classifier. Bridge at species rank (`rank = 'species'`).
Use `name` column (kraken) or `label` column (centrifuge/gottcha) for GTDB matching.
Expected bridge coverage: ???%.